Bah; belay that. Chalk it up to spending too long trying to turn the
project names into something human ;). The files are MEANT to include
en.zero et al (I'm not entirely sure why those are being split out -
presumably it was a request at some point).
On 11 March 2015 at 00:50, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
Hey,
This may be a known, but just in case it isn't; the pageview dumps at
http://dumps.wikimedia.org/other/pagecounts-all-sites/ are meant to
follow the spec set out at
http://dumps.wikimedia.org/other/pagecounts-all-sites/README.txt
Instead, it appears that for (presumably, zero-rated) requests, we're
ending up with lang_code.zero instead of lang_code.project_variant.
Presumably it's a missed use case in the C/Perl...thing, we were
using, that got ported to Hive? Check out pagecounts-20150301-000000
for an example.
I've opened a phabricator ticket at
https://phabricator.wikimedia.org/T92361 - this is just an advisory to
analytics engineers (there is a bug) and to reusers (there is a bug.
We're aware of the bug).
Have fun,
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation