Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

3 Oct 2013

This would be fantastic! I'll even volunteer to write API/web interface
code for it if you can have it up and running in two weeks ;-)

(and yes, that was a request for a readonly account ;-)

Cheers,
Magnus

On Wed, Oct 2, 2013 at 6:26 PM, Diederik van Liere
&lt;dvanliere(a)wikimedia.org&gt;wrote;wrote:

...
  I agree with Magnus; we decided to do a
'quick-and-dirty' approach that we
 can deliver in a single sprint (== 2 weeks).
 I think we defined the MVP as follows:

 1) Import data at daily granularity -- yes we are fully aware of requests
 for more fine-grained data
 2) Import data only for 2013 --- yes we are fully aware that people are
 likely to want to query the history
 3) Import the data into a MySQL instance in Labs -- yes this might not
 scale to many dimensions and/or has sufficient write performance
 4) Import the data using a very simple schema as specified in
 https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact
 table and whe can extend it with other dimensions easily)
 5) Community members can request  a readonly mysql account to query the
 data

 This is something i believe we can deliver in one sprint -- it just
 exposes the data as-is.

 There are many more requests:
 1) Data granularity
 2) Cleaning the current data
 3) Historic data
 4) API
 5) etc.. etc..

 but let's deal with those issues as they are raised by real-users. By all
 standards we would be almost ashamed of releasing this and I think that's
 the exact place we should aim for.

 D

 On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske &lt;magnusmanske(a)googlemail.com
  wrote: 
> I know I'm not completely unbiased here, but how long would a
> monthly-only SQL database take to create, compared to the "careful
> planning" approach?
>
> If it takes a few hours to write a per-month import script that will
> happily tick away in the background, I'd say go for it, and add more
> sophisticated things later.
>
> If it will take a programmer's week to do, I'd say wait for the survey.
>
>
> On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli <
> dtaraborelli(a)wikimedia.org&gt; wrote:
>
>> I think before we settle on a specific data store, we should determine
>> what are the top queries people are interested in running, whether they
>> expect to have scripted access to this data or primarily design a tool for
>> human access and whether applying a threshold and cutting the long tail of
>> low-traffic articles is a good approach for most consumers of this data.
>>
>> The GLAM case described by Magnus is pretty well-defined, but I'd like
>> to point out that:
>> • a large number of Wikipedias point to stats.grok.se from the history
>> page of every single article
>> • most researchers I've been talking to are interested in daily or
>> hourly pv data per article
>> • tools with a large user base like
>> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh
>> pv data on a weekly basis
>>
>> Should we list the requirements for different use cases on a wiki page
>> where a larger number of people than the participants in this thread can
>> voice their needs?
>>
>> Dario
>>
>> On Oct 2, 2013, at 8:16 AM, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt;
>> wrote:
>>
>> On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) &lt;nemowiki(a)gmail.com
>>  wrote: >>
>>> Magnus Manske, 02/10/2013 10:12:
>>>
>>>> Depending on the absolute value of "all costs", I'd prefer
#1, or a
>>>> combination of #2&#3.
>>>>
>>>> For GLAM (which is what I am mostly involved in), monthly page views
>>>> would suffice, and those should be easily done in MySQL.
>>>>
>>>> Daily views would be nice-to-have, but do not reed to be in MySQL.
>>>> [...]
>>>>
>>>
>>> I'd second this. We have partners (but also, say, internal
>>> WikiProjects) working on a long tail of tens or hundreds thousand pages
>>> with their own project: cutting this long tail, including redlinks, would
>>> be a higher loss than a decrease in resolution.
>>
>>
>>
>> Thank you both for the response, this is very useful to know.  If I'm
>> hearing people correctly so far:
>>
>> * reduced resolution is OK, handle requests for higher resolution data
>> further down the line.
>> * hacking the data to reduce size is OK if needed, but preferably the
>> hacks should not be lossy.
>> * a database is not absolutely 100% necessary but is preferred.
>>
>> If that's right, I have an additional question: would a non-relational
>> database be acceptable?  I'm not saying we're planning this, just
wondering
>> what people think.  If, for example, the data would be available in a
>> public Cassandra cluster.  Would people be willing to understand how CQL
>> [1] works?
>>
>>
>> [1] - http://cassandra.apache.org/doc/cql/CQL.html
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> undefined
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
undefined

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]