Hi,
even though not really involved until now i think i might be able to offer a hand here.
I needed some "Top Pages" for a small game i developed a while ago for my master
thesis and stats.grok.se was non functional at that time.
So i started to run some simple scripts to aggregate Domas' hourly files myself.
I continued running the scripts every couple of days as a side ever since, keeping all the
raw files and fixed several bugs over the time, so from what i can see it's pretty
solid now.
The last month(s) is/are not a problem as the scripts aggregate to days, months, years,
and all-time.
So for example
was accessed
10897800257 times aggregated over all files since 12.2007 till yesterday,
2592821012 times in 2012 till yesterday,
268588331 times in 11.2012
27076315 times in 12.2012 till yesterday.
(For today i'd need to wait for another 20 minutes as the downstream to our University
seems to be capped).
So I guess i pretty much already have the aggregates you're interested in and would be
happy to get involved to finally give something back to Wikimedia. As i seem to be a lousy
editor why not like this ;).
Cheers,
Jörn
On 04.12.2012, at 10:07, Magnus Manske <magnusmanske(a)googlemail.com> wrote:
Hi Erik,
in principle, yes, that would be useful. However:
* I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
* As a flat file it's not seek-able, which means I would have to run through the
entire thing for each of my ~50 page sets, or keep all 50 in memory; neither of which is
appealing
Maybe I could read such a file into a toolserver database? It would be a duplication of
effort, and add load to the toolserver, but hey ;-)
Cheers,
Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
I have code to aggregate Domas' hourly file into a daily file and later a monthly
file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can up the priority
to fix this.
Intro:
http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054590.html
Data:
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/
(ge5 is subset of only pages with 5+ views per month, which makes big difference in file
size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces(a)lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons, if possible.
As for the number of pages on those, that seems to fluctuate (probably just my scripts
breaking on slow/missing data, occasionally); the largest amount I can find is May 2012,
with ~350K pages. But, this only needs to run once a month. I could even give you the list
if you like, and you can extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL interface on the
toolserver would be ideal; a speedy http-based API would be good as well (maybe even
better, as it would not require the toolserver ;-), especially if it can take chunks of
data (e.g. POST requests with a 1K article list), so that I don't have to fire
thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere <dvanliere(a)wikimedia.org>
wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are interested in. Once I
know how many pages that are I can come up with a solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli <dtaraborelli(a)wikimedia.org>
wrote:
+1
I have received an increasing number of external requests for something more efficient
than the stats.grok.se JSON interface and more user-friendly than Domas' hourly raw
data. I am also one of the interested consumers of this data.
Diederik, any chance we could prioritize this request? I guess per-article and
per-project daily / monthly pv would be the most useful aggregation level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some are updated
once a month, some can be used live, but all are in high demand by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost grind to a
halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not really
high-speed; my on-demand tools have apparently been shut out recently because too many
people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list of pages on
many Wikimedia projects; I need view counts for these pages for a specific month,
per-page.
Now, I know that there is no public API yet, but is there any way I can get to the data,
at least for the monthly stats?
Cheers,
Magnus
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics