Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Daily and monthly aggregated files are available.
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/
Monthly files come in two variations,
1 article title and monthly total only
2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated
Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26 = Z, 27 = [, 28 = \, 29 = ], 30 = ^, 31 = _
so 33 views on day 2 hour 4 and 55 on same day hour 7 becomes BD33,BG55,
this obviously need machine unpacking when used
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
PS
Ah the dreaded demo effect ;-) I will see why files for last 40 days have not yet been generated.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Wednesday, October 02, 2013 5:16 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a
combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
* reduced resolution is OK, handle requests for higher resolution data further down the line.
* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
* a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?