I know I'm not completely unbiased here, but how
long would a monthly-only
SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will
happily tick away in the background, I'd say go for it, and add more
sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli <
dtaraborelli(a)wikimedia.org> wrote:
I think before we settle on a specific data
store, we should determine
what are the top queries people are interested in running, whether they
expect to have scripted access to this data or primarily design a tool for
human access and whether applying a threshold and cutting the long tail of
low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to
point out that:
• a large number of Wikipedias point to stats.grok.se from the history
page of every single article
• most researchers I've been talking to are interested in daily or hourly
pv data per article
• tools with a large user base like
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh
pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page
where a larger number of people than the participants in this thread can
voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>wrote;wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all
costs", I'd prefer #1, or a
combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views
would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects)
working on a long tail of tens or hundreds thousand pages with their own
project: cutting this long tail, including redlinks, would be a higher loss
than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm
hearing people correctly so far:
* reduced resolution is OK, handle requests for higher resolution data
further down the line.
* hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
* a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational
database be acceptable? I'm not saying we're planning this, just wondering
what people think. If, for example, the data would be available in a
public Cassandra cluster. Would people be willing to understand how CQL
[1] works?
[1] -
http://cassandra.apache.org/doc/cql/CQL.html
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
undefined
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org