Hi George,
Server mishaps often had to do with congestion, traffic overload being worsened by
non-essential routines running in parallel on the same server in early years.
I can't comment on the precise reasons per occasion why page view count files got
missing/corrupt. We haven't kept a journal for that.
Here are the dates I know of in last 5+ years with corrupt or incomplete counts, that
could not be repaired:
BTW I correct for these by extrapolating from remaining files for that month.
next if $file ge "projectcounts-20100611-000000" and $file lt
"projectcounts-20100617-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20100627-000000" and $file lt
"projectcounts-20100628-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20110908-000000" and $file lt
"projectcounts-20110915-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20111223-010000" and $file lt
"projectcounts-20111226-160000" ; # bad measurements on these dates
next if $file ge "projectcounts-20120413-000000" and $file lt
"projectcounts-20120417-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20121214-000000" and $file lt
"projectcounts-20130108-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20130723-000000" and $file lt
"projectcounts-20130724-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140105-000000" and $file lt
"projectcounts-20140107-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140827-000000" and $file lt
"projectcounts-20140828-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150803-180000" and $file lt
"projectcounts-20150803-230000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150810-150000" and $file lt
"projectcounts-20150810-210000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150811-170000" and $file lt
"projectcounts-20150811-180000" ; # bad measurements on these dates
Two or three larger periods of massive undercounting are not listed here, as these could
be repaired mostly on the per-wiki aggregation level. [1]
One ran for 7 months, and at its peak we lost 1/3 of messages,
http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-n…
I hope this helps,
Cheers, Erik
[1] by deducing the hourly loss rate per server from average gap between sequence numbers
(which should be on average 1000 with the sampled log).
From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org]
On Behalf Of Joseph Allemandou
Sent: Monday, September 14, 2015 15:05
To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] corrupted and missing log files
Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help :)
Cheers
Joseph
---------- Forwarded message ----------
From: George Gkotsis <gkotsis(a)gmail.com>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: kleduc(a)wikimedia.org, aotto(a)wikimedia.org, mforns(a)wikimedia.org, joal(a)wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including
researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College
London. I have recently finished downloading the massive weblog files dataset and I am
trying to "tame" the beast. As part of this process, I am reading all .gz files
that concern WIkimedia page visits (downloaded from
http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a
few examples I randomly sampled below:
Missing:
http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-201…
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-200…
http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-200…
Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give you a
full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George
--
/g
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal