Hi all!
Some of you are probably aware of the pagecounts-raw dataset hosted at
http://dumps.wikimedia.org/other/pagecounts-raw/
<http://dumps.wikimedia.org/other/pagecounts-raw/>. This week, we are making a
change to how this dataset is generated. This should be mostly transparent, but an
announcement is needed just in case anyone notices any differences.
pagecounts-raw has historically been generated by piping the udp2log webrequest logs into
a C program called webstatscollector[1]. This code is fairly old, and the logic it uses
to generate pagecounts is out of date. However, since this data has been public for so
long, we made an effort to continue to support it as is.
We are still in the process of backfilling, but eventually all pagecounts-raw data after
January 1 2015 will be generated from webrequest data stored in HDFS. This data is
collected using Kafka, and pagecounts-raw is now generated by Hive.
You may see a slight increase in article counts. The webrequest data in HDFS is less
lossy than the udp2log data.
By the way, do you know about the pagecounts-all-sites[2] dataset? pagecounts-all-sites
is in a similar format to pagecounts-raw, but comes with more up to date pagecount logic.
Most importantly, it includes mobile site pagecounts. Perhaps you should use
pagecounts-all-sites instead of pagecounts-raw, eh? :)
-Andrew Otto
[1]
https://github.com/wikimedia/analytics-webstatscollector
<https://github.com/wikimedia/analytics-webstatscollector>
[2]
http://dumps.wikimedia.org/other/pagecounts-all-sites/
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>