There's clearly value in monthly rollups and this should be manageable in
MySQL. This is probably the fastest way to get something in the hands of
our users.
Is this a worthwhile goal for the first iteration?
Once we see how people are using the data, we can start tackling the harder
problems of how to store/serve/query more granular data.
-Toby
On Wed, Oct 2, 2013 at 9:22 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Ah the dreaded
demo effect ;-) ****
** **
Turns out all files were generated, but rsync failed recently.****
So all is up to date now. Monthly aggregation for Sep is running. ****
** **
Erik****
** **
*From:* Erik Zachte [mailto:ezachte@wikimedia.org]
*Sent:* Wednesday, October 02, 2013 6:04 PM
*To:* 'A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.'
*Subject:* RE: [Analytics] Back of the envelope data size for "Queryable
public interface for pageview data" [was: Re: Queryable public interface
for pageview data]****
** **
Here is shameless plug for existing data files in a highly condensed
format, which could serve as input for whatever database we choose. ****
** **
Daily and monthly aggregated files are available.****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/****
** **
Daily files have all cruft included. Only monthly files are topped off
(and that could be changed of course).****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/****
** **
Monthly files come in two variations, ****
1 article title and monthly total only****
2 same plus hourly data for one month in one long string in a highly
compressed manner, comma separated ****
****
Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23
= X****
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26
= Z, 27 = [, 28 = \, 29 = ], 30 = ^, 31 = _****
** **
so 33 views on day 2 hour 4 and 55 on same day hour 7
becomes BD33,BG55,****
this obviously need machine unpacking when used****
** **
One goody of these files: they detect missing input files and correct
monthly counts to make up for this.****
** **
PS****
Ah the dreaded demo effect ;-) I will see why files for last 40 days have
not yet been generated.****
** **
Erik****
** **
** **
*From:* analytics-bounces(a)lists.wikimedia.org [
mailto:analytics-bounces@lists.wikimedia.org<analytics-bounces@lists.wikimedia.org>]
*On Behalf Of *Dan Andreescu
*Sent:* Wednesday, October 02, 2013 5:16 PM
*To:* A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Subject:* Re: [Analytics] Back of the envelope data size for "Queryable
public interface for pageview data" [was: Re: Queryable public interface
for pageview data]****
** **
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:****
Magnus Manske, 02/10/2013 10:12:****
Depending on the absolute value of "all costs", I'd prefer #1, or a
combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.****
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]**
**
I'd second this. We have partners (but also, say, internal WikiProjects)
working on a long tail of tens or hundreds thousand pages with their own
project: cutting this long tail, including redlinks, would be a higher loss
than a decrease in resolution.****
** **
** **
Thank you both for the response, this is very useful to know. If I'm
hearing people correctly so far:****
** **
* reduced resolution is OK, handle requests for higher resolution data
further down the line.****
* hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.****
* a database is not absolutely 100% necessary but is preferred.****
** **
If that's right, I have an additional question: would a non-relational
database be acceptable? I'm not saying we're planning this, just wondering
what people think. If, for example, the data would be available in a
public Cassandra cluster. Would people be willing to understand how CQL
[1] works?****
** **
** **
[1] -
http://cassandra.apache.org/doc/cql/CQL.html****
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics