Hey Magnus,
The Analytics Team will be present at the Amsterdam Hackathon, will you be there? One of things I was considering was to setup a new instance of the code that is running on stats.grok.se so you can hammer one of our servers and we wouldn't complain about it. D
On Tue, Dec 4, 2012 at 5:51 PM, Magnus Manske magnusmanske@googlemail.comwrote:
Hi Dario,
that would be fantastic! Throw in JSONP for the live tool, and I'm in stats-heaven ;-)
(especially if it can serve more than 1 requests/sec!!!)
Thanks for the quick reaction, Magnus
On Tue, Dec 4, 2012 at 10:40 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hi Magnus,
I discussed this with Diederik yesterday and we came up with the following proposal:
• Import domas' hourly pageview data into the cluster on a daily
basis • Run a daily pig script to get total pv counts per article using Oozie • Load data into a MySQL table on one of the internal data analysis DBs • Use web.py as framework to expose data via JSON (no visualization) on stat1001
this would allow us to publish the per-article pv data that you and others need with a reasonable frequency (assuming that people interested in hourly data will still use the raw dumps instead of this simple API).
Would that work for you guys?
Dario
On Dec 4, 2012, at 1:07 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi Erik,
in principle, yes, that would be useful. However:
- I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
- As a flat file it's not seek-able, which means I would have to run
through the entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be
a duplication of effort, and add load to the toolserver, but hey ;-)
Cheers, Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte ezachte@wikimedia.org
wrote:
I have code to aggregate Domas' hourly file into a daily file and later
a monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I
can up the priority to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big
difference in file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my scripts breaking on slow/missing data, occasionally); the largest amount I can find is May 2012, with ~350K pages. But, this only needs to run once a month. I could even give you the list if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on the toolserver would be ideal; a speedy http-based API would be good as well (maybe even better, as it would not require the toolserver ;-), especially if it can take chunks of data (e.g. POST requests with a 1K article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere <
dvanliere@wikimedia.org> wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in. Once I know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something
more efficient than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess
per-article and per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske magnusmanske@googlemail.com
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver.
Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well
but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the data, at least for the monthly stats?
Cheers,
Magnus
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics