Hi all!

Some of you are probably aware of the pagecounts-raw dataset hosted at http://dumps.wikimedia.org/other/pagecounts-raw/. This week, we are making a change to how this dataset is generated. This should be mostly transparent, but an announcement is needed just in case anyone notices any differences.

pagecounts-raw has historically been generated by piping the udp2log webrequest logs into a C program called webstatscollector[1]. This code is fairly old, and the logic it uses to generate pagecounts is out of date. However, since this data has been public for so long, we made an effort to continue to support it as is.

We are still in the process of backfilling, but eventually all pagecounts-raw data after January 1 2015 will be generated from webrequest data stored in HDFS. This data is collected using Kafka, and pagecounts-raw is now generated by Hive.

You may see a slight increase in article counts. The webrequest data in HDFS is less lossy than the udp2log data.

By the way, do you know about the pagecounts-all-sites[2] dataset? pagecounts-all-sites is in a similar format to pagecounts-raw, but comes with more up to date pagecount logic. Most importantly, it includes mobile site pagecounts. Perhaps you should use pagecounts-all-sites instead of pagecounts-raw, eh? :)

-Andrew Otto

[1] https://github.com/wikimedia/analytics-webstatscollector

[2] http://dumps.wikimedia.org/other/pagecounts-all-sites/