Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from
http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:
Missing:
Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George