Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

3 Oct 2013

      I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we
can deliver in a single sprint (== 2 weeks).
I think we defined the MVP as follows:
1) Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data
2) Import data only for 2013 --- yes we are fully aware that people are
likely to want to query the history
3) Import the data into a MySQL instance in Labs -- yes this might not
scale to many dimensions and/or has sufficient write performance
4) Import the data using a very simple schema as specified in
https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact
table and whe can extend it with other dimensions easily)
5) Community members can request  a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes
the data as-is.
There are many more requests:
1) Data granularity
2) Cleaning the current data
3) Historic data
4) API
5) etc.. etc..
but let's deal with those issues as they are raised by real-users. By all
standards we would be almost ashamed of releasing this and I think that's
the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske
magnusmanske@googlemail.comwrote:
...
I know I'm not completely unbiased here, but how long would a monthly-only
SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will
happily tick away in the background, I'd say go for it, and add more
sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
...
I think before we settle on a specific data store, we should determine
what are the top queries people are interested in running, whether they
expect to have scripted access to this data or primarily design a tool for
human access and whether applying a threshold and cutting the long tail of
low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to
point out that:
• a large number of Wikipedias point to stats.grok.se from the history
page of every single article
• most researchers I've been talking to are interested in daily or hourly
pv data per article
• tools with a large user base like
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh
pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page
where a larger number of people than the participants in this thread can
voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org
wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
...
Magnus Manske, 02/10/2013 10:12:
...
Depending on the absolute value of "all costs", I'd prefer #1, or a
combination of #2&#3.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects)
working on a long tail of tens or hundreds thousand pages with their own
project: cutting this long tail, including redlinks, would be a higher loss
than a decrease in resolution.
Thank you both for the response, this is very useful to know.  If I'm
hearing people correctly so far:

reduced resolution is OK, handle requests for higher resolution data

further down the line.

hacking the data to reduce size is OK if needed, but preferably the

hacks should not be lossy.

a database is not absolutely 100% necessary but is preferred.

If that's right, I have an additional question: would a non-relational
database be acceptable?  I'm not saying we're planning this, just wondering
what people think.  If, for example, the data would be available in a
public Cassandra cluster.  Would people be willing to understand how CQL
[1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
undefined

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]