The compaction of hourly page request files into daily, then daily into
monthly is operational.
Dec 2012:
Hourly files: 65 Gb
Daily files: 18 Gb
Monthly file: 5 Gb
Space saved as follows
1) each article title occurs only once instead of up to 744 times
2) bz2 compression
3) threshold of 5+ requests per month in final monthly file
Still all versions contain hourly resolution.
Each file starts with comments on file format. (in nutshell: sparse
indexing: day and hour are each encoded as letter, followed by count)
http://dumps.wikimedia.org/other/pagecounts-ez/merged/
As a spin-off the new data stream is also used for new monthly page request
report for all 800 wikis
e.g. English Wikipedia
http://tinyurl.com/bbnch45
The full list at
http://tinyurl.com/cq4rfla (alas no friendly front-end yet)
Maybe or maybe not suitable for Magnus, but anyway useful in its own right.
E.g. easy external archiving for posterity (Internet Archive), like tweet
archive of Library of Congress.
Erik Zachte
On 04.12.2012, at 10:07, Magnus Manske <magnusmanske(a)googlemail.com> wrote:
Hi Erik,
in principle, yes, that would be useful. However:
* I would mostly need "last month" on a continued basis, at the moment
stretching back to September 2012 I believe
* As a flat file it's not seek-able, which means I
would have to run
through the entire thing for each of my ~50 page sets, or keep all
50 in
memory; neither of which is appealing
Maybe I could read such a file into a toolserver database? It would be a
duplication of effort, and add load to the toolserver, but hey ;-)
Cheers,
Magnus
On Mon, Dec 3, 2012 at 11:59 PM, Erik Zachte <ezachte(a)wikimedia.org>
wrote:
I have code to aggregate Domas' hourly file into a
daily file and later a
monthly file and still retain full hourly resolution.
It has been a Xmas holiday past-time and is still a bit buggy, but I can
up the
priority to fix this.
difference in
file size)
Would this be useful, for you Magnus?
Erik
From: analytics-bounces(a)lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Monday, December 03, 2012 9:14 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest
in Wikipedia and analytics.
Subject: Re: [Analytics] Access to view stats
Hi Diderik,
in principle, all of the Wikimedia projects; currently, all listed at
http://stats.grok.se/ which is the top 100 or so Wikipedias. Plus Commons,
if possible.
As for the number of pages on those, that seems to fluctuate (probably
just my
scripts breaking on slow/missing data, occasionally); the largest
amount I can find is May 2012, with ~350K pages. But, this only needs to run
once a month. I could even give you the list if you like, and you can
extract the data for me ;-)
But that would certainly not be a long-term, scalable solution. An SQL
interface on
the toolserver would be ideal; a speedy http-based API would be
good as well (maybe even better, as it would not require the toolserver ;-),
especially if it can take chunks of data (e.g. POST requests with a 1K
article list), so that I don't have to fire thousands of tiny queries.
Cheers,
Magnus
On Mon, Dec 3, 2012 at 7:38 PM, Diederik van Liere
<dvanliere(a)wikimedia.org>
wrote:
Hi Magnus,
Can you the list of pages for the Wikimedia projects that you are
interested in.
Once I know how many pages that are I can come up with a
solution.
D
On Mon, Dec 3, 2012 at 2:33 PM, Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
+1
I have received an increasing number of external requests for something
more
efficient than the stats.grok.se JSON interface and more user-friendly
than Domas' hourly raw data. I am also one of the interested consumers of
this data.
Diederik, any chance we could prioritize this request? I guess per-article
and
per-project daily / monthly pv would be the most useful aggregation
level.
On Dec 3, 2012, at 11:11 AM, Magnus Manske <magnusmanske(a)googlemail.com>
wrote:
Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some
are
updated once a month, some can be used live, but all are in high demand
by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did
almost grind
to a halt recently. The on-demand tools have stalled
completely.
All these tools get their data from stats.grok.se, which works well but
not really
high-speed; my on-demand tools have apparently been shut out
recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a list
of pages on many Wikimedia projects; I need view counts for these pages for
a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can
get to the
data, at least for the monthly stats?
Cheers,
Magnus
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics