Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

3 Oct 2013

      I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: 
• a large number of Wikipedias point to stats.grok.se from the history page of every single article
• most researchers I've been talking to are interested in daily or hourly pv data per article
• tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
...
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a
combination of #2&#3.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know.  If I'm hearing people correctly so far:

reduced resolution is OK, handle requests for higher resolution data further down the line.
hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
a database is not absolutely 100% necessary but is preferred.

If that's right, I have an additional question: would a non-relational database be acceptable?  I'm not saying we're planning this, just wondering what people think.  If, for example, the data would be available in a public Cassandra cluster.  Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]