Hi George, I don't really know about historical numbers :( I forward your message to the Analytics mailing list to get some more help :) Cheers Joseph
---------- Forwarded message ---------- From: George Gkotsis gkotsis@gmail.com Date: Mon, Sep 14, 2015 at 2:36 PM Subject: corrupted and missing log files To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:
*Missing:* http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-2010... http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-2008... http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-2009...
*Corrupted:* pagecounts-20080304-030000.gz pagecounts-20080304-140000.gz pagecounts-20080304-150000.gz pagecounts-20090921-160000.gz (the list is quite long and I haven't finished processing it, but I can give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards, George
Hi George,
Server mishaps often had to do with congestion, traffic overload being worsened by non-essential routines running in parallel on the same server in early years.
I can't comment on the precise reasons per occasion why page view count files got missing/corrupt. We haven't kept a journal for that.
Here are the dates I know of in last 5+ years with corrupt or incomplete counts, that could not be repaired:
BTW I correct for these by extrapolating from remaining files for that month.
next if $file ge "projectcounts-20100611-000000" and $file lt "projectcounts-20100617-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20100627-000000" and $file lt "projectcounts-20100628-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20110908-000000" and $file lt "projectcounts-20110915-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20111223-010000" and $file lt "projectcounts-20111226-160000" ; # bad measurements on these dates
next if $file ge "projectcounts-20120413-000000" and $file lt "projectcounts-20120417-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20121214-000000" and $file lt "projectcounts-20130108-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20130723-000000" and $file lt "projectcounts-20130724-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140105-000000" and $file lt "projectcounts-20140107-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140827-000000" and $file lt "projectcounts-20140828-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150803-180000" and $file lt "projectcounts-20150803-230000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150810-150000" and $file lt "projectcounts-20150810-210000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150811-170000" and $file lt "projectcounts-20150811-180000" ; # bad measurements on these dates
Two or three larger periods of massive undercounting are not listed here, as these could be repaired mostly on the per-wiki aggregation level. [1]
One ran for 7 months, and at its peak we lost 1/3 of messages, http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-ne...
I hope this helps,
Cheers, Erik
[1] by deducing the hourly loss rate per server from average gap between sequence numbers (which should be on average 1000 with the sampled log).
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Joseph Allemandou Sent: Monday, September 14, 2015 15:05 To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] corrupted and missing log files
Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help :)
Cheers
Joseph
---------- Forwarded message ---------- From: George Gkotsis gkotsis@gmail.com Date: Mon, Sep 14, 2015 at 2:36 PM Subject: corrupted and missing log files To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:
Missing:
http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-2010...
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-2008...
http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-2009...
Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give you a full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George
Users also keep a list in the stats.grok.se FAQ: https://en.wikipedia.org/wiki/User:Killiondude/stats#Are_there_known_dates_f...
Nemo