Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

2 Oct 2013

I think we're on the same page – I'm very much in favor of a quick and dirty
release, just saying that we cannot answer the SQL vs NoSQL or what dimensions to expose
without a better understanding of what data people need: pushing data out and let
consumers play with it is by far the best approach to identify requirements.

On Oct 2, 2013, at 10:26 AM, Diederik van Liere &lt;dvanliere(a)wikimedia.org&gt; wrote:

...
  I agree with Magnus; we decided to do a
'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks).
 I think we defined the MVP as follows:

 1) Import data at daily granularity -- yes we are fully aware of requests for more
fine-grained data
 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to
query the history
 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many
dimensions and/or has sufficient write performance 
 4) Import the data using a very simple schema as specified in
https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe
can extend it with other dimensions easily)
 5) Community members can request  a readonly mysql account to query the data

 This is something i believe we can deliver in one sprint -- it just exposes the data
as-is.

 There are many more requests:
 1) Data granularity
 2) Cleaning the current data
 3) Historic data
 4) API
 5) etc.. etc..

 but let's deal with those issues as they are raised by real-users. By all standards
we would be almost ashamed of releasing this and I think that's the exact place we
should aim for. 

 D

 On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske &lt;magnusmanske(a)googlemail.com&gt; wrote:
 I know I'm not completely unbiased here, but how long would a monthly-only SQL
database take to create, compared to the "careful planning" approach?

 If it takes a few hours to write a per-month import script that will happily tick away in
the background, I'd say go for it, and add more sophisticated things later.

 If it will take a programmer's week to do, I'd say wait for the survey.

 On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli &lt;dtaraborelli(a)wikimedia.org&gt;
wrote:
 I think before we settle on a specific data store, we should determine what are the top
queries people are interested in running, whether they expect to have scripted access to
this data or primarily design a tool for human access and whether applying a threshold and
cutting the long tail of low-traffic articles is a good approach for most consumers of
this data.

 The GLAM case described by Magnus is pretty well-defined, but I'd like to point out
that: 
 • a large number of Wikipedias point to stats.grok.se from the history page of every
single article
 • most researchers I've been talking to are interested in daily or hourly pv data per
article
 • tools with a large user base like
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly
basis

 Should we list the requirements for different use cases on a wiki page where a larger
number of people than the participants in this thread can voice their needs?

 Dario

 On Oct 2, 2013, at 8:16 AM, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:

  On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva
(Nemo) &lt;nemowiki(a)gmail.com&gt; wrote:
 Magnus Manske, 02/10/2013 10:12:
 Depending on the absolute value of "all costs", I'd prefer #1, or a
 combination of #2&#3.

 For GLAM (which is what I am mostly involved in), monthly page views
 would suffice, and those should be easily done in MySQL.

 Daily views would be nice-to-have, but do not reed to be in MySQL. [...]

 I'd second this. We have partners (but also, say, internal WikiProjects) working on a
long tail of tens or hundreds thousand pages with their own project: cutting this long
tail, including redlinks, would be a higher loss than a decrease in resolution.

 Thank you both for the response, this is very useful to know.  If I'm hearing people
correctly so far:

 * reduced resolution is OK, handle requests for higher resolution data further down the
line.
 * hacking the data to reduce size is OK if needed, but preferably the hacks should not be
lossy.
 * a database is not absolutely 100% necessary but is preferred.

 If that's right, I have an additional question: would a non-relational database be
acceptable?  I'm not saying we're planning this, just wondering what people think.
 If, for example, the data would be available in a public Cassandra cluster.  Would people
be willing to understand how CQL [1] works?

 [1] - http://cassandra.apache.org/doc/cql/CQL.html
 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics  

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 -- 
 undefined

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]