Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.

 

Daily and monthly aggregated files are available.

http://dumps.wikimedia.org/other/pagecounts-ez/merged/

 

Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).

http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/

 

Monthly files come in two variations,

1 article title and monthly total only

2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated

               

                Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X

                Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26 = Z, 27 = [, 28 = \, 29 = ], 30 = ^, 31 = _

 

               so 33 views on day 2 hour 4 and 55 on same day hour 7 becomes BD33,BG55,

               this obviously need machine unpacking when used

 

One goody of these files: they detect missing input files and correct monthly counts to make up for this.

 

PS

Ah the dreaded demo effect ;-)  I will see why files for last 40 days have not yet been generated.

 

Erik

 

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Wednesday, October 02, 2013 5:16 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

 

On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:

Magnus Manske, 02/10/2013 10:12:

Depending on the absolute value of "all costs", I'd prefer #1, or a
combination of #2&#3.

For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.

Daily views would be nice-to-have, but do not reed to be in MySQL. [...]


I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.

 

 

Thank you both for the response, this is very useful to know.  If I'm hearing people correctly so far:

 

* reduced resolution is OK, handle requests for higher resolution data further down the line.

* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.

* a database is not absolutely 100% necessary but is preferred.

 

If that's right, I have an additional question: would a non-relational database be acceptable?  I'm not saying we're planning this, just wondering what people think.  If, for example, the data would be available in a public Cassandra cluster.  Would people be willing to understand how CQL [1] works?

 

 

[1] - http://cassandra.apache.org/doc/cql/CQL.html