There's clearly value in monthly rollups and this should be manageable in MySQL. This is probably the fastest way to get something in the hands of our users.
Is this a worthwhile goal for the first iteration?
Once we see how people are using the data, we can start tackling the harder problems of how to store/serve/query more granular data.
-Toby
On Wed, Oct 2, 2013 at 9:22 AM, Erik Zachte ezachte@wikimedia.org wrote:
Ah the dreaded demo effect ;-) ****
Turns out all files were generated, but rsync failed recently.****
So all is up to date now. Monthly aggregation for Sep is running. ****
Erik****
*From:* Erik Zachte [mailto:ezachte@wikimedia.org] *Sent:* Wednesday, October 02, 2013 6:04 PM
*To:* 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.' *Subject:* RE: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]****
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose. ****
Daily and monthly aggregated files are available.****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/****
Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/****
Monthly files come in two variations, ****
1 article title and monthly total only****
2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated ****
**** Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23
= X****
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26
= Z, 27 = [, 28 = , 29 = ], 30 = ^, 31 = _****
so 33 views on day 2 hour 4 and 55 on same day hour 7
becomes BD33,BG55,****
this obviously need machine unpacking when used****
One goody of these files: they detect missing input files and correct monthly counts to make up for this.****
PS****
Ah the dreaded demo effect ;-) I will see why files for last 40 days have not yet been generated.****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.organalytics-bounces@lists.wikimedia.org] *On Behalf Of *Dan Andreescu
*Sent:* Wednesday, October 02, 2013 5:16 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]****
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:****
Magnus Manske, 02/10/2013 10:12:****
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.****
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]** **
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.****
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:****
- reduced resolution is OK, handle requests for higher resolution data
further down the line.****
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.****
- a database is not absolutely 100% necessary but is preferred.****
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?****
[1] - http://cassandra.apache.org/doc/cql/CQL.html****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics