Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help
:)
Cheers
Joseph
---------- Forwarded message ----------
From: George Gkotsis <gkotsis(a)gmail.com>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: kleduc(a)wikimedia.org, aotto(a)wikimedia.org, mforns(a)wikimedia.org,
joal(a)wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to
everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for
King's College London. I have recently finished downloading the massive
weblog files dataset and I am trying to "tame" the beast. As part of this
process, I am reading all .gz files that concern WIkimedia page visits
(downloaded from
http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt
archives. I paste a few examples I randomly sampled below:
*Missing:*
http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-201…
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-200…
http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-200…
*Corrupted:*
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can
give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George
--
/g
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal