Heya,
We spoke a little bit more about getting a queryable public interface for pageview data up and running and we decided the following:
1) Start importing webstatscollector pageview daily data for 2013 into mysql running on labs (not yet scheduled in a sprint)
2) Make simple datawarehouse schema for mysql db (based on the current webstatscollector datafiles)
Page Table ========== page_id page title
Fact Table ======== fact_id page_id (FK -> Page Table) pageview date pageview count bytes served
3) Collect more datapoints to determine how high of a priority mobile site article pageview counts are to decide whether we should add this to webstatscollector or not.
Best, Diederik
Hi,
today there were some guesses about how much data we are actually talking for the “queryable public interface for pageview data”. As no one followed-up on that, I did some quick “back of the envelope” computations.
The pagecounts files now come with ~7-9M data points/hour.
As we talked about computing "per day" data, some readings snuggle up and collapse, and for the two days that I computed I arrived at ~70M datapoints/day.
So each day we can expect ~70M new data points = ~70M new rows.
Now to the per row cost:
On Mon, Sep 30, 2013 at 01:57:58PM -0700, Diederik van Liere wrote:
- Make simple datawarehouse schema for mysql db (based on the current
webstatscollector datafiles)
Page Table
page_id page title
Let's ignore that table for now.
Fact Table
fact_id
Let's ignore "fact_id" as it's more than the bare pagecount files provide.
page_id (FK -> Page Table)
As we're having 70M different rows/day, MEDIUMINT is too small. Having an INT foreign key, we may also run out of numbers soon. But let's also ignore that for now, and settle with INT = 4 bytes.
pageview date
DATE is 3 bytes
pageview count
Zipping through some random pagecount files, I saw values up to ~17M/hour. 17M is to small for MEDIUMINT, so we have to take at least INT = 4 bytes. That also suffices for the per day sums.
bytes served
Zipping through some random pagecount files, I saw values up to ~54G/hour. That's too big for INT. So let's settle for BIGINT = 8 bytes. That also suffices for the per day sums.
Summing up, we have 4+3+4+8 = 19 bytes per data point.
19 bytes * 70M * 365 days = 485 GB in 1 year.
And here the 19 bytes are really minimal. It will suffer overrun ids. And id does not contain a row id. 70M is very, very defensive. And the total size is without the page table, without indices, without all the other stuff that comes in.
Have fun, Christian
P.S.: There was some discussion whether or not we should drop "bytes served", as "no one uses it". I am glad we did not second-guess the requirements and took the field in, as that bought us a nice argument for free, if we decide to drop/discuss it. "bytes served" is worth ~200GB in 1 year, when modelling it directly.
While such a decrease would help, it would not solve the problem, as ~285 GB data per year remain.
P.P.S.: Hope is not lost :-) There are of course thousands of ways to get the size down.
300GB per year should be trivial. It might increase slightly if we decide to use a relational database for everything; daily data could live in special flat files, and monthly could be in SQL.
In terms of hacks, we can use 2 bytes per date if we count days from Jan 1 2001 (day 1 A.W. = ante Wikipedia ;-) which would last for ~180 years. Down to 10 bytes already!
And, how about two tables, one with 4 byte view counts for "much-viewed" pages (few), and one with 2 bytes (<65.536 views/day) for the majority? Probably saving another 1.9 bytes/row here, at the cost of more complex low-level queries.
On Tue, Oct 1, 2013 at 11:52 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
today there were some guesses about how much data we are actually talking for the “queryable public interface for pageview data”. As no one followed-up on that, I did some quick “back of the envelope” computations.
The pagecounts files now come with ~7-9M data points/hour.
As we talked about computing "per day" data, some readings snuggle up and collapse, and for the two days that I computed I arrived at ~70M datapoints/day.
So each day we can expect ~70M new data points = ~70M new rows.
Now to the per row cost:
On Mon, Sep 30, 2013 at 01:57:58PM -0700, Diederik van Liere wrote:
- Make simple datawarehouse schema for mysql db (based on the current
webstatscollector datafiles)
Page Table
page_id page title
Let's ignore that table for now.
Fact Table
fact_id
Let's ignore "fact_id" as it's more than the bare pagecount files provide.
page_id (FK -> Page Table)
As we're having 70M different rows/day, MEDIUMINT is too small. Having an INT foreign key, we may also run out of numbers soon. But let's also ignore that for now, and settle with INT = 4 bytes.
pageview date
DATE is 3 bytes
pageview count
Zipping through some random pagecount files, I saw values up to ~17M/hour. 17M is to small for MEDIUMINT, so we have to take at least INT = 4 bytes. That also suffices for the per day sums.
bytes served
Zipping through some random pagecount files, I saw values up to ~54G/hour. That's too big for INT. So let's settle for BIGINT = 8 bytes. That also suffices for the per day sums.
Summing up, we have 4+3+4+8 = 19 bytes per data point.
19 bytes * 70M * 365 days = 485 GB in 1 year.
And here the 19 bytes are really minimal. It will suffer overrun ids. And id does not contain a row id. 70M is very, very defensive. And the total size is without the page table, without indices, without all the other stuff that comes in.
Have fun, Christian
P.S.: There was some discussion whether or not we should drop "bytes served", as "no one uses it". I am glad we did not second-guess the requirements and took the field in, as that bought us a nice argument for free, if we decide to drop/discuss it. "bytes served" is worth ~200GB in 1 year, when modelling it directly.
While such a decrease would help, it would not solve the problem, as ~285 GB data per year remain.
P.P.S.: Hope is not lost :-) There are of course thousands of ways to get the size down.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
There is a lot of cruft in the data files. Requests for non existing pages, e.g. due to typos.
I saved about 1/3 on the monthly aggregates by putting in a threshold of 5+ views per month.
In a database this could be done by a monthly cleanup job, which runs soon after month is complete.
This job calculates monthly totals and deletes articles for which this total is below a threshold.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: Wednesday, October 02, 2013 1:11 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
300GB per year should be trivial. It might increase slightly if we decide to use a relational database for everything; daily data could live in special flat files, and monthly could be in SQL.
In terms of hacks, we can use 2 bytes per date if we count days from Jan 1 2001 (day 1 A.W. = ante Wikipedia ;-) which would last for ~180 years. Down to 10 bytes already!
And, how about two tables, one with 4 byte view counts for "much-viewed" pages (few), and one with 2 bytes (<65.536 views/day) for the majority? Probably saving another 1.9 bytes/row here, at the cost of more complex low-level queries.
On Tue, Oct 1, 2013 at 11:52 PM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi,
today there were some guesses about how much data we are actually talking for the “queryable public interface for pageview data”. As no one followed-up on that, I did some quick “back of the envelope” computations.
The pagecounts files now come with ~7-9M data points/hour.
As we talked about computing "per day" data, some readings snuggle up and collapse, and for the two days that I computed I arrived at ~70M datapoints/day.
So each day we can expect ~70M new data points = ~70M new rows.
Now to the per row cost:
On Mon, Sep 30, 2013 at 01:57:58PM -0700, Diederik van Liere wrote:
- Make simple datawarehouse schema for mysql db (based on the current
webstatscollector datafiles)
Page Table
page_id page title
Let's ignore that table for now.
Fact Table
fact_id
Let's ignore "fact_id" as it's more than the bare pagecount files provide.
page_id (FK -> Page Table)
As we're having 70M different rows/day, MEDIUMINT is too small. Having an INT foreign key, we may also run out of numbers soon. But let's also ignore that for now, and settle with INT = 4 bytes.
pageview date
DATE is 3 bytes
pageview count
Zipping through some random pagecount files, I saw values up to ~17M/hour. 17M is to small for MEDIUMINT, so we have to take at least INT = 4 bytes. That also suffices for the per day sums.
bytes served
Zipping through some random pagecount files, I saw values up to ~54G/hour. That's too big for INT. So let's settle for BIGINT = 8 bytes. That also suffices for the per day sums.
Summing up, we have 4+3+4+8 = 19 bytes per data point.
19 bytes * 70M * 365 days = 485 GB in 1 year.
And here the 19 bytes are really minimal. It will suffer overrun ids. And id does not contain a row id. 70M is very, very defensive. And the total size is without the page table, without indices, without all the other stuff that comes in.
Have fun, Christian
P.S.: There was some discussion whether or not we should drop "bytes served", as "no one uses it". I am glad we did not second-guess the requirements and took the field in, as that bought us a nice argument for free, if we decide to drop/discuss it. "bytes served" is worth ~200GB in 1 year, when modelling it directly.
While such a decrease would help, it would not solve the problem, as ~285 GB data per year remain.
P.P.S.: Hope is not lost :-) There are of course thousands of ways to get the size down.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 tel:%2B43%20732%20%2F%2026%2095%2063 Fax: +43 732 / 26 95 63 tel:%2B43%20732%20%2F%2026%2095%2063 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Erik,
On Wed, Oct 02, 2013 at 01:38:02AM +0200, Erik Zachte wrote:
There is a lot of cruft in the data files.
Ha! That's a good point.
I saved about 1/3 on the monthly aggregates by putting in a threshold of 5+ views per month.
Has anybody tried putting a threshold on the pagecount data? What was the outcome?
Best regards, Christian
Hi,
On Wed, Oct 02, 2013 at 12:10:51AM +0100, Magnus Manske wrote:
300GB per year should be trivial. It might increase slightly if we decide to use a relational database for everything; [...]
I wholeheartedly agree. Relational databases come with many aspects and downsides that we do not need in our setting, and in fact are even in the way. However, up to now, the request is to get the data into MySQL. And there were some voices that are a bit concerned to see even 300GB in a MySQL database.
But I am not sure what parts of the requirement are set in stone :-)
[ exploiting structure in data ]
Yes, I completely agree. The data comes with lots and lots of structure. We can easily get for example the date down way below 1 byte. And also the page counts are screaming to have their structure exploited.
However:
[...] at the cost of more complex low-level queries.
that's the real problem. Let's find a sweet spot that allows to perform queries still easily enough, while keeping data sizes manageable on the backend. Give us some ammunition to discuss the requirements :-)
* Do we need to have all of the data in MySQL in a trivial schema at all costs, or * is it better to have the data at least in MySQL but with a not totally trivial way to query, or * would it be even better to have a nice clean interface to the data that you maybe cannot query by SQL but allows to formulate queries in a straight forward way?
What would you prefer?
Best regards, Christian
On Wed, Oct 2, 2013 at 8:51 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
- Do we need to have all of the data in MySQL in a trivial schema at all costs, or
- is it better to have the data at least in MySQL but with a not totally trivial way to query, or
- would it be even better to have a nice clean interface to the data that you maybe cannot query by SQL but allows to formulate queries in a straight forward way?
What would you prefer?
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. They could even be ticket-based: I submit a request for daily views for month X for a few thousand pages, and get a ticket ID back; then I can periodically check if the ticket is finished, and pick up the cached results.
If the issue is not storage (and it shouldn't be!) but rather MySQL database size, how about a new database for each month, covering all projects? Even sqlite might be a good alternative, allowing for server-less SQL.
Cheers, Magnus
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Nemo
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
* reduced resolution is OK, handle requests for higher resolution data further down the line. * hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy. * a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Daily and monthly aggregated files are available.
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/
Monthly files come in two variations,
1 article title and monthly total only
2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated
Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26 = Z, 27 = [, 28 = , 29 = ], 30 = ^, 31 = _
so 33 views on day 2 hour 4 and 55 on same day hour 7 becomes BD33,BG55,
this obviously need machine unpacking when used
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
PS
Ah the dreaded demo effect ;-) I will see why files for last 40 days have not yet been generated.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, October 02, 2013 5:16 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
* reduced resolution is OK, handle requests for higher resolution data further down the line.
* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
* a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose. ****
**
Hm, I don't think we will have much trouble with the size of the input. We're currently thinking of processing the hourly data through Hadoop, and that shouldn't even blink at a few TB of data per day. What we'd like to come to consensus on is the most useful output format. So far, I'm hearing that monthly aggregates by page in a MySQL database is the bare minimum we should release on day 1. We can then iterate and add any useful dimensions to this data (like category information) or increase the resolution in parallel tables. If the data becomes too large for MySQL we can look at other databases to store the data in.
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
This is something we definitely want to port to Hadoop / use in some way.
Hm, I don't think we will have much trouble with the size of the input.
Well my post was also about how to store hourly data in a concise manner (sparse array really), so we could serve hourly precision without too much overhead.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, October 02, 2013 6:19 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Hm, I don't think we will have much trouble with the size of the input. We're currently thinking of processing the hourly data through Hadoop, and that shouldn't even blink at a few TB of data per day. What we'd like to come to consensus on is the most useful output format. So far, I'm hearing that monthly aggregates by page in a MySQL database is the bare minimum we should release on day 1. We can then iterate and add any useful dimensions to this data (like category information) or increase the resolution in parallel tables. If the data becomes too large for MySQL we can look at other databases to store the data in.
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
This is something we definitely want to port to Hadoop / use in some way.
Hm, I don't think we will have much trouble with the size of the input.
Well my post was also about how to store hourly data in a concise manner (sparse array really), so we could serve hourly precision without too much overhead. ****
**
Well, I think your files do that pretty well, no need to duplicate that work. The main desire here seems to be for a queryable database with as much data as possible. I think the idea is to have a reliable datasource on top of which something like stats.grok.se can be built. Sure we can build this on top of flat files, but it sounds like people would rather deal with a database.
That said, I think the database would be isomorphic to your sparse array format, because it wouldn't store a cross product of pages to hours. It would just have rows for where data exists. It would repeat the "page_id" column, sure, but maybe hierarchical databases could help with that.
Dan
Erik Zachte, 02/10/2013 18:03:
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Daily and monthly aggregated files are available.
Ah, I thought this had disappeared, nice. Can you also get the main page to be updated? Currently we get to http://dumps.wikimedia.org/other/pagecounts-ez/ which has only broken links and warnings in red.
Nemo
Good point, Nemo. Done.
http://dumps.wikimedia.org/other/pagecounts-ez/
Erik
-----Original Message----- From: Federico Leva (Nemo) [mailto:nemowiki@gmail.com] Sent: Wednesday, October 02, 2013 9:26 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: Erik Zachte Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
Erik Zachte, 02/10/2013 18:03:
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Daily and monthly aggregated files are available.
Ah, I thought this had disappeared, nice. Can you also get the main page to be updated? Currently we get to http://dumps.wikimedia.org/other/pagecounts-ez/ which has only broken links and warnings in red.
Nemo
Ah the dreaded demo effect ;-)
Turns out all files were generated, but rsync failed recently.
So all is up to date now. Monthly aggregation for Sep is running.
Erik
From: Erik Zachte [mailto:ezachte@wikimedia.org] Sent: Wednesday, October 02, 2013 6:04 PM To: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.' Subject: RE: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Daily and monthly aggregated files are available.
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/
Monthly files come in two variations,
1 article title and monthly total only
2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated
Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26 = Z, 27 = [, 28 = , 29 = ], 30 = ^, 31 = _
so 33 views on day 2 hour 4 and 55 on same day hour 7 becomes BD33,BG55,
this obviously need machine unpacking when used
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
PS
Ah the dreaded demo effect ;-) I will see why files for last 40 days have not yet been generated.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, October 02, 2013 5:16 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
* reduced resolution is OK, handle requests for higher resolution data further down the line.
* hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
* a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
There's clearly value in monthly rollups and this should be manageable in MySQL. This is probably the fastest way to get something in the hands of our users.
Is this a worthwhile goal for the first iteration?
Once we see how people are using the data, we can start tackling the harder problems of how to store/serve/query more granular data.
-Toby
On Wed, Oct 2, 2013 at 9:22 AM, Erik Zachte ezachte@wikimedia.org wrote:
Ah the dreaded demo effect ;-) ****
Turns out all files were generated, but rsync failed recently.****
So all is up to date now. Monthly aggregation for Sep is running. ****
Erik****
*From:* Erik Zachte [mailto:ezachte@wikimedia.org] *Sent:* Wednesday, October 02, 2013 6:04 PM
*To:* 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.' *Subject:* RE: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]****
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose. ****
Daily and monthly aggregated files are available.****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/****
Daily files have all cruft included. Only monthly files are topped off (and that could be changed of course).****
http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-08/****
Monthly files come in two variations, ****
1 article title and monthly total only****
2 same plus hourly data for one month in one long string in a highly compressed manner, comma separated ****
**** Hour: from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23
= X****
Day: from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26
= Z, 27 = [, 28 = , 29 = ], 30 = ^, 31 = _****
so 33 views on day 2 hour 4 and 55 on same day hour 7
becomes BD33,BG55,****
this obviously need machine unpacking when used****
One goody of these files: they detect missing input files and correct monthly counts to make up for this.****
PS****
Ah the dreaded demo effect ;-) I will see why files for last 40 days have not yet been generated.****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.organalytics-bounces@lists.wikimedia.org] *On Behalf Of *Dan Andreescu
*Sent:* Wednesday, October 02, 2013 5:16 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]****
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:****
Magnus Manske, 02/10/2013 10:12:****
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.****
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]** **
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.****
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:****
- reduced resolution is OK, handle requests for higher resolution data
further down the line.****
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.****
- a database is not absolutely 100% necessary but is preferred.****
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?****
[1] - http://cassandra.apache.org/doc/cql/CQL.html****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote: Magnus Manske, 02/10/2013 10:12: Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data further down the line.
- hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
1) Import data at daily granularity -- yes we are fully aware of requests for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests: 1) Data granularity 2) Cleaning the current data 3) Historic data 4) API 5) etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske magnusmanske@googlemail.comwrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think we're on the same page – I'm very much in favor of a quick and dirty release, just saying that we cannot answer the SQL vs NoSQL or what dimensions to expose without a better understanding of what data people need: pushing data out and let consumers play with it is by far the best approach to identify requirements.
On Oct 2, 2013, at 10:26 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests for more fine-grained data
- Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history
- Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance
- Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily)
- Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske magnusmanske@googlemail.com wrote: I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote: Magnus Manske, 02/10/2013 10:12: Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data further down the line.
- hacking the data to reduce size is OK if needed, but preferably the hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
+1
On Wed, Oct 2, 2013 at 10:35 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think we're on the same page – I'm very much in favor of a quick and dirty release, just saying that we cannot answer the SQL vs NoSQL or what dimensions to expose without a better understanding of what data people need: pushing data out and let consumers play with it is by far the best approach to identify requirements.
On Oct 2, 2013, at 10:26 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Awesome Diederik!
Those five points will solve my chapters need for at least the remainder of this year!
*Med vänliga hälsningar, Jan Ainali*
Verksamhetschef, Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 0729 - 67 29 48
2013/10/2 Diederik van Liere dvanliere@wikimedia.org
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This would be fantastic! I'll even volunteer to write API/web interface code for it if you can have it up and running in two weeks ;-)
(and yes, that was a request for a readonly account ;-)
Cheers, Magnus
On Wed, Oct 2, 2013 at 6:26 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
<bump>Soooo... still sprinting?</bump>
On Wed, Oct 2, 2013 at 6:26 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
I agree with Magnus; we decided to do a 'quick-and-dirty' approach that we can deliver in a single sprint (== 2 weeks). I think we defined the MVP as follows:
- Import data at daily granularity -- yes we are fully aware of requests
for more fine-grained data 2) Import data only for 2013 --- yes we are fully aware that people are likely to want to query the history 3) Import the data into a MySQL instance in Labs -- yes this might not scale to many dimensions and/or has sufficient write performance 4) Import the data using a very simple schema as specified in https://mingle.corp.wikimedia.org/projects/analytics/cards/1195 (one fact table and whe can extend it with other dimensions easily) 5) Community members can request a readonly mysql account to query the data
This is something i believe we can deliver in one sprint -- it just exposes the data as-is.
There are many more requests:
- Data granularity
- Cleaning the current data
- Historic data
- API
- etc.. etc..
but let's deal with those issues as they are raised by real-users. By all standards we would be almost ashamed of releasing this and I think that's the exact place we should aim for.
D
On Wed, Oct 2, 2013 at 1:16 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
I know I'm not completely unbiased here, but how long would a monthly-only SQL database take to create, compared to the "careful planning" approach?
If it takes a few hours to write a per-month import script that will happily tick away in the background, I'd say go for it, and add more sophisticated things later.
If it will take a programmer's week to do, I'd say wait for the survey.
On Wed, Oct 2, 2013 at 6:10 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I think before we settle on a specific data store, we should determine what are the top queries people are interested in running, whether they expect to have scripted access to this data or primarily design a tool for human access and whether applying a threshold and cutting the long tail of low-traffic articles is a good approach for most consumers of this data.
The GLAM case described by Magnus is pretty well-defined, but I'd like to point out that: • a large number of Wikipedias point to stats.grok.se from the history page of every single article • most researchers I've been talking to are interested in daily or hourly pv data per article • tools with a large user base like https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages refresh pv data on a weekly basis
Should we list the requirements for different use cases on a wiki page where a larger number of people than the participants in this thread can voice their needs?
Dario
On Oct 2, 2013, at 8:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Wed, Oct 2, 2013 at 5:16 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Magnus Manske, 02/10/2013 10:12:
Depending on the absolute value of "all costs", I'd prefer #1, or a combination of #2.
For GLAM (which is what I am mostly involved in), monthly page views would suffice, and those should be easily done in MySQL.
Daily views would be nice-to-have, but do not reed to be in MySQL. [...]
I'd second this. We have partners (but also, say, internal WikiProjects) working on a long tail of tens or hundreds thousand pages with their own project: cutting this long tail, including redlinks, would be a higher loss than a decrease in resolution.
Thank you both for the response, this is very useful to know. If I'm hearing people correctly so far:
- reduced resolution is OK, handle requests for higher resolution data
further down the line.
- hacking the data to reduce size is OK if needed, but preferably the
hacks should not be lossy.
- a database is not absolutely 100% necessary but is preferred.
If that's right, I have an additional question: would a non-relational database be acceptable? I'm not saying we're planning this, just wondering what people think. If, for example, the data would be available in a public Cassandra cluster. Would people be willing to understand how CQL [1] works?
[1] - http://cassandra.apache.org/doc/cql/CQL.html _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- undefined
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics