This would be fantastic! I'll even volunteer to write API/web interface code for it if you can have it up and running in two weeks ;-)
(and yes, that was a request for a readonly account ;-)
Cheers, Magnus
On Wed, Oct 2, 2013 at 6:26 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics