Thanks, Erik. I actually noticed the empty title records in the hourly files recently too. I didn't make the connection that it could have been the culprit though. To give an example of one type of output I make, here are the most popular articles for different media types from a 3 day span from yesterday. Your compressed files will definitely open up some new scenarios though.
https://docs.google.com/spreadsheets/d/19IoFHy-U0JInOzi32_iemTXcEmGudeK-jXUDpp5m0UE/edit?usp=sharing



From: ezachte@wikimedia.org
To: analytics@lists.wikimedia.org
Date: Tue, 24 Feb 2015 23:09:53 +0100
Subject: Re: [Analytics] Monthly compressed traffic delay

Michael, a quick heads-up:

 

So I finally found the time to look into this.

Sorry that it took so long.

https://phabricator.wikimedia.org/T90230

Bug has been analyzed and fixed.

 

The underlying problem is a record in an hourly pageview dump with empty title. My script now patches such records with title '-no-title-'.

I filed a separate bug for that: https://phabricator.wikimedia.org/T90629

 

Daily aggregation has been restarted and successfully processed data for Jan 27. Now it will take a day or two to catch up.

 

Cheers,

Erik

 

 

From: Erik Zachte [mailto:ezachte@wikimedia.org]
Sent: Thursday, February 19, 2015 4:13
To: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'
Subject: RE: [Analytics] Monthly compressed traffic delay

 

Hi Michael,

 

Thanks for your offer, I appreciate it.

I've been quite busy in recent weeks , but haven't forgotten abouth these compressed dumps, and will look into it soon (less than a week).

 

Cheers,

Erik

 

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Michael Hale
Sent: Wednesday, February 18, 2015 15:24
To: analytics@lists.wikimedia.org
Subject: [Analytics] Monthly compressed traffic delay

 

Hello,

I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though.

I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer?

Thanks,
Michael


_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics