Hi George,
Server mishaps often had to do with congestion, traffic overload being worsened by non-essential routines running in parallel on the same server in early years.
I can't comment on the precise reasons per occasion why page view count files got missing/corrupt. We haven't kept a journal for that.
Here are the dates I know of in last 5+ years with corrupt or incomplete counts, that could not be repaired:
BTW I correct for these by extrapolating from remaining files for that month.
next if $file ge "projectcounts-20100611-000000" and $file lt "projectcounts-20100617-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20100627-000000" and $file lt "projectcounts-20100628-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20110908-000000" and $file lt "projectcounts-20110915-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20111223-010000" and $file lt "projectcounts-20111226-160000" ; # bad measurements on these dates
next if $file ge "projectcounts-20120413-000000" and $file lt "projectcounts-20120417-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20121214-000000" and $file lt "projectcounts-20130108-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20130723-000000" and $file lt "projectcounts-20130724-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140105-000000" and $file lt "projectcounts-20140107-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140827-000000" and $file lt "projectcounts-20140828-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150803-180000" and $file lt "projectcounts-20150803-230000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150810-150000" and $file lt "projectcounts-20150810-210000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150811-170000" and $file lt "projectcounts-20150811-180000" ; # bad measurements on these dates
Two or three larger periods of massive undercounting are not listed here, as these could be repaired mostly on the per-wiki aggregation level. [1]
One ran for 7 months, and at its peak we lost 1/3 of messages, http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-news/
I hope this helps,
Cheers, Erik
[1] by deducing the hourly loss rate per server from average gap between sequence numbers (which should be on average 1000 with the sampled log).
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Joseph Allemandou
Sent: Monday, September 14, 2015 15:05
To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] corrupted and missing log files
Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help :)
Cheers
Joseph
---------- Forwarded message ----------
From: George Gkotsis <gkotsis@gmail.com>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:
Missing:
Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George
--
/g
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal