Hi George,

 

Server mishaps often had to do with congestion, traffic overload being worsened by non-essential routines running in parallel on the same server in early years.

I can't comment on the precise reasons per occasion why page view count files got missing/corrupt. We haven't kept a journal for that.

 

Here are the dates I know of in last 5+ years with corrupt or incomplete counts, that could not be repaired:

BTW I correct for these by extrapolating from remaining files for that month.

 

      next if $file ge "projectcounts-20100611-000000" and $file lt "projectcounts-20100617-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20100627-000000" and $file lt "projectcounts-20100628-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20110908-000000" and $file lt "projectcounts-20110915-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20111223-010000" and $file lt "projectcounts-20111226-160000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20120413-000000" and $file lt "projectcounts-20120417-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20121214-000000" and $file lt "projectcounts-20130108-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20130723-000000" and $file lt "projectcounts-20130724-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20140105-000000" and $file lt "projectcounts-20140107-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20140827-000000" and $file lt "projectcounts-20140828-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150803-180000" and $file lt "projectcounts-20150803-230000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150810-150000" and $file lt "projectcounts-20150810-210000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150811-170000" and $file lt "projectcounts-20150811-180000" ; # bad measurements on these dates

 

Two or three larger periods of massive undercounting are not listed here, as these could be repaired mostly on the per-wiki aggregation level. [1]

One ran for 7 months, and at its peak we lost 1/3 of messages, http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-news/

 

I hope this helps,

 

Cheers, Erik

 

[1] by deducing the hourly loss rate per server from average gap between sequence numbers (which should be on average 1000 with the sampled log).

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Joseph Allemandou
Sent: Monday, September 14, 2015 15:05
To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] corrupted and missing log files

 

Hi George,

I don't really know about historical numbers :(

I forward your message to the Analytics mailing list to get some more help :)

Cheers

Joseph

 

---------- Forwarded message ----------
From: George Gkotsis <gkotsis@gmail.com>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: kleduc@wikimedia.org, aotto@wikimedia.org, mforns@wikimedia.org, joal@wikimedia.org

Greetings Wikimedia Analytics team!

 

First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me.

 

My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).

 

Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below:

 

Missing:

 

Corrupted:

pagecounts-20080304-030000.gz

pagecounts-20080304-140000.gz

pagecounts-20080304-150000.gz

pagecounts-20090921-160000.gz

(the list is quite long and I haven't finished processing it, but I can give you a full log file)

 

Could you provide some feedback concerning the above cases?

 

Best regards,

George

 

--

/g



 

--

Joseph Allemandou

Data Engineer @ Wikimedia Foundation

IRC: joal