Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help :)
Cheers
Joseph

---------- Forwarded message ----------
From: George Gkotsis <gkotsis@gmail.com>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org


Greetings Wikimedia Analytics team!

First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.

My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).

Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:

Missing:

Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give you a full log file)

Could you provide some feedback concerning the above cases?

Best regards,
George

--
/g



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal