I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that:• a large number of Wikipedias point to stats.grok.se from the history page of every single article• most researchers I've been talking to are interested in daily or hourly pv data per article• tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basisShould we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?DarioOn Oct 2, 2013, at 8:16 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or aDaily views would be nice-to-have, but do not reed to be in MySQL. [...]
combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:* reduced resolution is OK, handle requests for higher resolution data further down the line.* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.* a database is not absolutely 100% necessary but is preferred.If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics