as you might know, I have a few GLAM-related tools on the toolserver. Some
are updated once a month, some can be used live, but all are in high demand
by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost
grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not
really high-speed; my on-demand tools have apparently been shut out
recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a
list of pages on many Wikimedia projects; I need view counts for these
pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get
to the data, at least for the monthly stats?
There's discussion at
https://bugzilla.wikimedia.org/show_bug.cgi?id=44448 about how skin
usage correlates with who's an active editor.
It would be great to know what percentage of active editor (5+ edits in
the main namespace) uses each skin on English Wikipedia. Perhaps for
the last three months.
Apologies for crossposting
The Analytics Team is planning to deploy "tab as field delimiter" to
replace the current space as fielddelimiter on the varnish/squid/nginx
servers. We would like to do this on February 1st. The reason for this
change is that we need to have a consistent number of fields in each
webrequest log line. Right now, some fields contain spaces and that require
a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
* udp-filter already has support for the tab character
* webstatscollector: we compiled a new version of filter to add support for
the tab character
* wikistats: we will fix the scripts on an ongoing basis.
* udp2log: we have a patch ready for inserting sequence numbers separated
In particular, I would like to have feedback to three questions:
1) Are there important reasons not to use tab as field delimiter?
2) Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
3) Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
There has been some recent discussion  on wikidata-l about making
sure wikidata.org page views are part of the stats that are being
collected at stats.grok.se. I did a quick check to see if page views
on www.wikidata.org were showing up, and they don't appear to be.
curl --silent http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-201…
| zcat - | egrep ' Q\d+ '
de Q100 2 15502
de Q5 2 95286
en Q0 1 7815
en Q1 6 70841
en Q10 2 32918
en Q101 1 7709
en Q102 1 8128
en Q107 1 26148
en Q11 1 387
en Q19 2 14304
en Q2 7 44454
en Q3 1 8624
en Q35 1 22856
en Q4 3 40588
en Q4000 2 28866
en Q5 3 16698
en Q612 1 7152
en Q6600 2 29929
en Q7 2 16604
en Q9450 2 59452
fr Q4 2 24352
fr Q400 1 17339
hu Q10 3 105745
ja Q0 1 8685
ja Q10 36 781763
ja Q4 1 8873
ja Q400 1 25208
ko Q3 1 22350
nl Q65 2 24562
ru Q10 1 31355
ru Q4000 1 11140
sv Q10 1 365
zh Q1 1 10950
zh Q10 2 17738
Does anyone have any idea how to get stats.grok.se updated?
I think I've seen from previous conversation here that an alternate
source for stats is being created? Is there any information available
on how to use that if it's available yet?
Last week I resurrected Wikihadoop from the Summer of Research 2011 when we
wrote a Hadoop-based input parser for bzipped XML dumps of Wikipedia and
the ability to create diffs between revisions. This work was mainly done by
Yusuke Matsubara, Aaron Halfaker and Fabian Kaelin.
This is working again and if you have input / suggestions on how to expose
this data, then please let me know!
Since yesterday afternoon we have been having intermittent packetloss
issues with Emery. Ori and myself wrote yesterday a simple patch to enable
sampling on two filters to reduce the workload. That seemed to work however
this morning packetloss came back. Paravoid pointed out that it was
probably due to the fact that Emery was running for 208 days and that older
versions of the kernel start doing weird things.
Rebooting Emery seemed to have resolved the issue and we are planning to
upgrade to Precise soon.
I wanted to follow up and second on Magnus' December
more usable way to access page view stats. Mining these stats is
attracting an increasing amount of attention from researchers (
even as the current approaches for extracting them from stats.grok.se or
the dumps are slow and inhumane (respectively).
I'm also interested in looking at bursts of pageview activity on articles
and then examining the extent to which this pageview activity diffuses over
the local wiki-link network. I suspect this has strong implications for
understanding patterns of editing activity; namely, editing activity may be
non-trivially coupled with sudden attention to articles that are a few
degrees of separation away. I'd be happy to chat with folks inside or
outside of WMF about getting access to the relevant view stats and
beginning such an analysis.
This is a pretty interesting and accessible description of best practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling.
E3: check out section "3.2 Client Events" in particular, which is quite relevant to EventLogging.