Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers, Magnus
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers, Magnus _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution. D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers, Magnus _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'm just throwing a virtual thumbs up at Magnus for pointing this out. It's unfortunately become increasingly more of a problem for our GLAM projects. I appreciate you bringing it up, Magnus - and for the quick responses.
Best, Lori
On Mon, Dec 3, 2012 at 2:38 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution. D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers, Magnus _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
By the way this is tracked at https://bugzilla.wikimedia.org/show_bug.cgi?id=42259 (Feel free to add links to the dozens of previous discussions on the topic.)
Nemo
Hi all,
I am in regular contact with the GLAM/toolset Europeana project and so I will make sure that your needs are addressed. I will definitely try to help Magnus out on the short term as well!
best, Diederik
On Mon, Dec 3, 2012 at 2:55 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
By the way this is tracked at https://bugzilla.wikimedia.** org/show_bug.cgi?id=42259https://bugzilla.wikimedia.org/show_bug.cgi?id=42259 (Feel free to add links to the dozens of previous discussions on the topic.)
Nemo
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers, Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution. D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers, Magnus _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I have code to aggregate Domas' hourly file into a daily file and later a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se http://stats.grok.se/ , which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers,
Magnus
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Erik,
in principle, yes, that would be useful. However: * I would mostly need "last month" on a continued basis, at the moment stretching back to September 2012 I believe * As a flat file it's not seek-able, which means I would have to run through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org wrote:
I have code to aggregate Domas' hourly file into a daily file and later a monthly file and still retain full hourly resolution.****
It has been a Xmas holiday past-time and is still a bit buggy, but I can up the priority to fix this.****
Intro:****
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html***
Data:****
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/****
(ge5 is subset of only pages with 5+ views per month, which makes big difference in file size)****
Would this be useful, for you Magnus?****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Magnus Manske *Sent:* Monday, December 03, 2012 9:14 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Access to view stats****
Hi Diderik,****
in principle, all of the Wikimedia projects; currently, all listed at http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.****
As for the number of pages on those, that seems to fluctuate (probably just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)****
But that would certainly not be a long-term, scalable solution. An SQL interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.****
Cheers,****
Magnus****
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:****
Hi Magnus,****
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution.****
D****
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:****
+1 ****
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data. ****
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.****
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:****
Hi all,****
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.****
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.****
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(** **
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.****
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?****
Cheers,****
Magnus****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Magnus,
I discussed this with Diederik yesterday and we came up with the following proposal:
• Import domas' hourly pageview data into the cluster on a daily basis • Run a daily pig script to get total pv counts per article using Oozie • Load data into a MySQL table on one of the internal data analysis DBs • Use web.py as framework to expose data via JSON (no visualization) on stat1001
this would allow us to publish the per-article pv data that you and others need with a reasonable frequency (assuming that people interested in hourly data will still use the raw dumps instead of this simple API).
Would that work for you guys?
Dario
On Dec 4, 2012, at 1:07 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org wrote: I have code to aggregate Domas' hourly file into a daily file and later a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Dario,
that would be fantastic! Throw in JSONP for the live tool, and I'm in stats-heaven ;-)
(especially if it can serve more than 1 requests/sec!!!)
Thanks for the quick reaction, Magnus
On Tue, Dec 4, 2012 at 10:40 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hi Magnus,
I discussed this with Diederik yesterday and we came up with the following proposal:
• Import domas' hourly pageview data into the cluster on a daily
basis • Run a daily pig script to get total pv counts per article using Oozie • Load data into a MySQL table on one of the internal data analysis DBs • Use web.py as framework to expose data via JSON (no visualization) on stat1001
this would allow us to publish the per-article pv data that you and others need with a reasonable frequency (assuming that people interested in hourly data will still use the raw dumps instead of this simple API).
Would that work for you guys?
Dario
On Dec 4, 2012, at 1:07 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run
through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a
duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org
wrote:
I have code to aggregate Domas' hourly file into a daily file and later
a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can
up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big
difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere <
dvanliere@wikimedia.org> wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something
more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess
per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver.
Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but
not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there? One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it. D
On Tue, Dec 4, 2012 at 5:51 PM, Magnus Manske magnusmanske@googlemail.comwrote:
Hi Dario,
that would be fantastic! Throw in JSONP for the live tool, and I'm in stats-heaven ;-)
(especially if it can serve more than 1 requests/sec!!!)
Thanks for the quick reaction, Magnus
On Tue, Dec 4, 2012 at 10:40 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hi Magnus,
I discussed this with Diederik yesterday and we came up with the following proposal:
• Import domas' hourly pageview data into the cluster on a daily
basis • Run a daily pig script to get total pv counts per article using Oozie • Load data into a MySQL table on one of the internal data analysis DBs • Use web.py as framework to expose data via JSON (no visualization) on stat1001
this would allow us to publish the per-article pv data that you and others need with a reasonable frequency (assuming that people interested in hourly data will still use the raw dumps instead of this simple API).
Would that work for you guys?
Dario
On Dec 4, 2012, at 1:07 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run
through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be
a duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org
wrote:
I have code to aggregate Domas' hourly file into a daily file and later
a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I
can up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big
difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere <
dvanliere@wikimedia.org> wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something
more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess
per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver.
Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well
but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Diederik,
On Thu, May 9, 2013 at 11:55 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there?
Sadly, no, I won't.
One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it.
That would be fantastic! At the moment, my GLAM tool on the toolserver cannot update anymore, because they refuse to give me enough database connections for my tools [1]. GLAM people (and my tool users in general) are not amused.
Running the update tool against a local API, or better, a database, would make these things much easier and faster. I could even set it up to get category information from the Wiki(p|m)edia API, so it could run on Labs even before the DB mirrors are available.
Cheers, Magnus
[bumping thread]
There have been requests on cultural-partners and wikitech mailing lists about this again. Is there a timetable to get /any/ page view stats exposed on labs tools?
Cheers, Magnus
On Fri, May 10, 2013 at 9:17 AM, Magnus Manske magnusmanske@googlemail.comwrote:
Hi Diederik,
On Thu, May 9, 2013 at 11:55 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there?
Sadly, no, I won't.
One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it.
That would be fantastic! At the moment, my GLAM tool on the toolserver cannot update anymore, because they refuse to give me enough database connections for my tools [1]. GLAM people (and my tool users in general) are not amused.
Running the update tool against a local API, or better, a database, would make these things much easier and faster. I could even set it up to get category information from the Wiki(p|m)edia API, so it could run on Labs even before the DB mirrors are available.
Cheers, Magnus
So it's been six months since my original request here. People keep telling me how badly my toolserver GLAM tools are degrading, and it's very frustrating to have to tell them that I can't do much about it.
Any update on this? Anything?
On Mon, Jun 3, 2013 at 9:56 AM, Magnus Manske magnusmanske@googlemail.comwrote:
[bumping thread]
There have been requests on cultural-partners and wikitech mailing lists about this again. Is there a timetable to get /any/ page view stats exposed on labs tools?
Cheers, Magnus
On Fri, May 10, 2013 at 9:17 AM, Magnus Manske < magnusmanske@googlemail.com> wrote:
Hi Diederik,
On Thu, May 9, 2013 at 11:55 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there?
Sadly, no, I won't.
One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it.
That would be fantastic! At the moment, my GLAM tool on the toolserver cannot update anymore, because they refuse to give me enough database connections for my tools [1]. GLAM people (and my tool users in general) are not amused.
Running the update tool against a local API, or better, a database, would make these things much easier and faster. I could even set it up to get category information from the Wiki(p|m)edia API, so it could run on Labs even before the DB mirrors are available.
Cheers, Magnus
Hey Magnus,
Do you have experience with writing puppet manifests? If so then I can pair you up with one of the analytics team members. We are currently severly understaffed which keeps delaying this.
Maybe we can Skype next week so I can explain the larger context.
Best Diederik
Sent from my iPhone
On 2013-06-15, at 10:30, Magnus Manske magnusmanske@googlemail.com wrote:
So it's been six months since my original request here. People keep telling me how badly my toolserver GLAM tools are degrading, and it's very frustrating to have to tell them that I can't do much about it.
Any update on this? Anything?
On Mon, Jun 3, 2013 at 9:56 AM, Magnus Manske magnusmanske@googlemail.com wrote:
[bumping thread]
There have been requests on cultural-partners and wikitech mailing lists about this again. Is there a timetable to get /any/ page view stats exposed on labs tools?
Cheers, Magnus
On Fri, May 10, 2013 at 9:17 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Diederik,
On Thu, May 9, 2013 at 11:55 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there?
Sadly, no, I won't.
One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it.
That would be fantastic! At the moment, my GLAM tool on the toolserver cannot update anymore, because they refuse to give me enough database connections for my tools [1]. GLAM people (and my tool users in general) are not amused.
Running the update tool against a local API, or better, a database, would make these things much easier and faster. I could even set it up to get category information from the Wiki(p|m)edia API, so it could run on Labs even before the DB mirrors are available.
Cheers, Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sat, Jun 15, 2013 at 6:28 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Hey Magnus,
Hi Diederik ,
Do you have experience with writing puppet manifests? If so then I can pair you up with one of the analytics team members. We are currently severly understaffed which keeps delaying this.
I know roughly what they are for, but have no experience in writing them. Given some pointers, I could figure it out, though I'm not sure why. I do have access to Tools Labs, so if there is view data on Labs somewhere, I'm sure it could be exposed (anonymized) similar to the wiki(m|p)edia databases; not sure why that would require a puppetized setup. Unless you want me to write the server myself, that is ;-)
Maybe we can Skype next week so I can explain the larger context.
Sure! Best would probably be my evening, which would PST 11am or later.
Cheers, Magnus
On Sat, Jun 15, 2013 at 12:24 PM, Magnus Manske <magnusmanske@googlemail.com
wrote:
On Sat, Jun 15, 2013 at 6:28 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hey Magnus,
Hi Diederik ,
Do you have experience with writing puppet manifests? If so then I can pair you up with one of the analytics team members. We are currently severly understaffed which keeps delaying this.
I know roughly what they are for, but have no experience in writing them. Given some pointers, I could figure it out, though I'm not sure why.
I can help with Puppet stuff, but I'm also not sure how it's related. Diederik, what needs to Puppetized? rsync from one of the production hosts to labs?
Ori
A web.py / mysql app needs puppetization plus a script that rsyncs from dataset2 to stat1001 plus a single file c++ program And perhaps the debianization of some python libraries; not sure though D
Sent from my iPhone
On 2013-06-15, at 19:41, Ori Livneh ori@wikimedia.org wrote:
On Sat, Jun 15, 2013 at 12:24 PM, Magnus Manske magnusmanske@googlemail.com wrote:
On Sat, Jun 15, 2013 at 6:28 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Hey Magnus,
Hi Diederik ,
Do you have experience with writing puppet manifests? If so then I can pair you up with one of the analytics team members. We are currently severly understaffed which keeps delaying this.
I know roughly what they are for, but have no experience in writing them. Given some pointers, I could figure it out, though I'm not sure why.
I can help with Puppet stuff, but I'm also not sure how it's related. Diederik, what needs to Puppetized? rsync from one of the production hosts to labs?
Ori _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
The reason for the puppetization is so we can run the code in production - that's ops's rule. Now, I'm not familiar with Magnus's app but can't it run in labs using labsdb? Either way, I can help with this during off-hours, you've waited long enough Magnus.
Dan
On Sat, Jun 15, 2013 at 7:51 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
A web.py / mysql app needs puppetization plus a script that rsyncs from dataset2 to stat1001 plus a single file c++ program And perhaps the debianization of some python libraries; not sure though D
Sent from my iPhone
On 2013-06-15, at 19:41, Ori Livneh ori@wikimedia.org wrote:
On Sat, Jun 15, 2013 at 12:24 PM, Magnus Manske < magnusmanske@googlemail.com> wrote:
On Sat, Jun 15, 2013 at 6:28 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hey Magnus,
Hi Diederik ,
Do you have experience with writing puppet manifests? If so then I can pair you up with one of the analytics team members. We are currently severly understaffed which keeps delaying this.
I know roughly what they are for, but have no experience in writing them. Given some pointers, I could figure it out, though I'm not sure why.
I can help with Puppet stuff, but I'm also not sure how it's related. Diederik, what needs to Puppetized? rsync from one of the production hosts to labs?
Ori
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dan Andreescu, 16/06/2013 14:08:
The reason for the puppetization is so we can run the code in production
- that's ops's rule. Now, I'm not familiar with Magnus's app but can't
it run in labs using labsdb? Either way, I can help with this during off-hours, you've waited long enough Magnus.
AFAIK here we're discussing about how to set up a stats.grok.se clone in WMF cluster. So the requirements are 1) a machine with 8 TB disk (for the DB) and 12 GB RAM or so, 2) whatever it takes to run the code at https://github.com/abelsson/stats.grok.se
Nemo
I think we can do with less hardware; initially we can keep only the data for the last 3 months and I don't think it needs 12gb for a simple webapp D
Sent from my iPhone
On 2013-06-16, at 8:20, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Dan Andreescu, 16/06/2013 14:08:
The reason for the puppetization is so we can run the code in production
- that's ops's rule. Now, I'm not familiar with Magnus's app but can't
it run in labs using labsdb? Either way, I can help with this during off-hours, you've waited long enough Magnus.
AFAIK here we're discussing about how to set up a stats.grok.se clone in WMF cluster. So the requirements are
- a machine with 8 TB disk (for the DB) and 12 GB RAM or so,
- whatever it takes to run the code at https://github.com/abelsson/stats.grok.se
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Diederik van Liere, 16/06/2013 14:30:
I think we can do with less hardware; initially we can keep only the data for the last 3 months and I don't think it needs 12gb for a simple webapp
Well, that's what Henrik uses on his own. I hope the WMF can offer a machine which doesn't have less performance than a 3 years old server offered by a volunteer on his own.
Nemo
Well nemo don't jump to conclusions too soon. D — Sent from Mailbox for iPhone
On Sun, Jun 16, 2013 at 8:38 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Diederik van Liere, 16/06/2013 14:30:
I think we can do with less hardware; initially we can keep only the data for the last 3 months and I don't think it needs 12gb for a simple webapp
Well, that's what Henrik uses on his own. I hope the WMF can offer a machine which doesn't have less performance than a 3 years old server offered by a volunteer on his own. Nemo
Apologies if my request caused confusion as to what I actually want. I'll try to explain.
Right now, I have several GLAM tools on the toolserver that deal with (detailed and aggregated) view counts. They get their data from stats.grok.se, some via PHP/shell, others via JavaScript/JSONP.
I have an account with tools labs, and have ported several tools there already [1]. I really want to port the GLAM tools as well, but I see little point in doing so as long as stats.grok.se remains a bottleneck.
So, a 1:1 "mirror" of stats.grok.se would work for me; however, I think this can be done more efficient and resource-saving. Thus, /ideally/, I would like a read-only MySQL database, accessible on tools labs, that has data on how often every page, in every project, was viewed on each day. Monthly view counts would be OK as well, hourly would be fine but too detailed for my purposes (though other people might want that eventually).
Therefore, my thought was that /if/ this data already exists as a database on the labs infrastructure (which I understood is the case), would it be possible to expose that part of it, similar to the way a part of the wiki(p|m)edia projects are exposed (e.g. user data is filtered out).
So I don't need some giant machine to run scripts on; tools labs will do just fine for my purposes. The database is what I'm after ;-)
Thanks, Magnus
[1] tools.wmflabs.org/magnustools
On Sun, Jun 16, 2013 at 2:18 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Well nemo don't jump to conclusions too soon. D — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone
On Sun, Jun 16, 2013 at 8:38 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Diederik van Liere, 16/06/2013 14:30:
I think we can do with less hardware; initially we can keep only the
data for the last 3 months and I don't think it needs 12gb for a simple webapp
Well, that's what Henrik uses on his own. I hope the WMF can offer a machine which doesn't have less performance than a 3 years old server offered by a volunteer on his own.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Magnus Manske, 16/06/2013 17:25:
Therefore, my thought was that /if/ this data already exists as a database on the labs infrastructure (which I understood is the case), would it be possible to expose that part of it, similar to the way a part of the wiki(p|m)edia projects are exposed (e.g. user data is filtered out).
So I don't need some giant machine to run scripts on; tools labs will do just fine for my purposes. The database is what I'm after ;-)
The database doesn't exist in Labs. They have a copy of the raw data, mostly so that Hydriz can upload it to archive.org.
Nemo
Is this a migration (largely as-is) or a mini-project to adopt the stats.grok.se code to Kraken (xlate to Pig rather than just build a VM to run the front end and back end)?
I would not have thought that labsDB clones would have this data...
Just curious about the process for definition of requirements when moving something like this (existing functionality).
--Michael
--Michael
On Jun 16, 2013, at 2:30 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
I think we can do with less hardware; initially we can keep only the data for the last 3 months and I don't think it needs 12gb for a simple webapp D
Sent from my iPhone
On 2013-06-16, at 8:20, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Dan Andreescu, 16/06/2013 14:08:
The reason for the puppetization is so we can run the code in production
- that's ops's rule. Now, I'm not familiar with Magnus's app but can't
it run in labs using labsdb? Either way, I can help with this during off-hours, you've waited long enough Magnus.
AFAIK here we're discussing about how to set up a stats.grok.se clone in WMF cluster. So the requirements are
- a machine with 8 TB disk (for the DB) and 12 GB RAM or so,
- whatever it takes to run the code at https://github.com/abelsson/stats.grok.se
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes, yes, yes! This would be fantastic! On 05/12/2012 8:40 AM, "Dario Taraborelli" dtaraborelli@wikimedia.org wrote:
Hi Magnus,
I discussed this with Diederik yesterday and we came up with the following proposal:
• Import domas' hourly pageview data into the cluster on a daily
basis • Run a daily pig script to get total pv counts per article using Oozie • Load data into a MySQL table on one of the internal data analysis DBs • Use web.py as framework to expose data via JSON (no visualization) on stat1001
this would allow us to publish the per-article pv data that you and others need with a reasonable frequency (assuming that people interested in hourly data will still use the raw dumps instead of this simple API).
Would that work for you guys?
Dario
On Dec 4, 2012, at 1:07 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run
through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a
duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org
wrote:
I have code to aggregate Domas' hourly file into a daily file and later
a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can
up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big
difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere <
dvanliere@wikimedia.org> wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something
more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess
per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver.
Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but
not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
even though not really involved until now i think i might be able to offer a hand here. I needed some "Top Pages" for a small game i developed a while ago for my master thesis and stats.grok.se was non functional at that time. So i started to run some simple scripts to aggregate Domas' hourly files myself.
I continued running the scripts every couple of days as a side ever since, keeping all the raw files and fixed several bugs over the time, so from what i can see it's pretty solid now.
The last month(s) is/are not a problem as the scripts aggregate to days, months, years, and all-time. So for example http://en.wikipedia.org/wiki/Main_Page was accessed 10897800257 times aggregated over all files since 12.2007 till yesterday, 2592821012 times in 2012 till yesterday, 268588331 times in 11.2012 27076315 times in 12.2012 till yesterday. (For today i'd need to wait for another 20 minutes as the downstream to our University seems to be capped).
So I guess i pretty much already have the aggregates you're interested in and would be happy to get involved to finally give something back to Wikimedia. As i seem to be a lousy editor why not like this ;).
Cheers, Jörn
On 04.12.2012, at 10:07, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org wrote: I have code to aggregate Domas' hourly file into a daily file and later a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
+1
I have received an increasing number of external requests for something more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
The compaction of hourly page request files into daily, then daily into monthly is operational.
Dec 2012:
Hourly files: 65 Gb Daily files: 18 Gb Monthly file: 5 Gb
Space saved as follows 1) each article title occurs only once instead of up to 744 times 2) bz2 compression 3) threshold of 5+ requests per month in final monthly file
Still all versions contain hourly resolution. Each file starts with comments on file format. (in nutshell: sparse indexing: day and hour are each encoded as letter, followed by count)
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
As a spin-off the new data stream is also used for new monthly page request report for all 800 wikis e.g. English Wikipedia http://tinyurl.com/bbnch45 The full list at http://tinyurl.com/cq4rfla (alas no friendly front-end yet)
Maybe or maybe not suitable for Magnus, but anyway useful in its own right. E.g. easy external archiving for posterity (Internet Archive), like tweet archive of Library of Congress.
Erik Zachte
On 04.12.2012, at 10:07, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run
through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a
duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org
wrote:
I have code to aggregate Domas' hourly file into a daily file and later a
monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can
up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big
difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere
dvanliere@wikimedia.org wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli
dtaraborelli@wikimedia.org wrote:
+1
I have received an increasing number of external requests for something
more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article
and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some
are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but
not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics