Hi George, I don't really know about historical numbers :( I forward your message to the Analytics mailing list to get some more help :) Cheers Joseph
---------- Forwarded message ---------- From: George Gkotsis gkotsis@gmail.com Date: Mon, Sep 14, 2015 at 2:36 PM Subject: corrupted and missing log files To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:
*Missing:* http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-2010... http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-2008... http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-2009...
*Corrupted:* pagecounts-20080304-030000.gz pagecounts-20080304-140000.gz pagecounts-20080304-150000.gz pagecounts-20090921-160000.gz (the list is quite long and I haven't finished processing it, but I can give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards, George