4) API3) Historic data2) Cleaning the current data1) Data granularityThis is something i believe we can deliver in one sprint -- it just exposes the data as-is.5) Community members can request a readonly mysql account to query the data4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily)3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history1) Import data at daily granularity -- yes we are fully aware of requests for more fine-grained dataI agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks).I think we defined the MVP as follows:
There are many more requests:
5) etc.. etc..but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D_______________________________________________On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com> wrote:
If it will take a programmer's week to do, I'd say wait for the survey.I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
--On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that:• a large number of Wikipedias point to stats.grok.se from the history page of every single article• most researchers I've been talking to are interested in daily or hourly pv data per article• tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basisShould we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?DarioOn Oct 2, 2013, at 8:16 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or aDaily views would be nice-to-have, but do not reed to be in MySQL. [...]
combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:* reduced resolution is OK, handle requests for higher resolution data further down the line.* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.* a database is not absolutely 100% necessary but is preferred.If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
undefined
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics