as you might know, I have a few GLAM-related tools on the toolserver. Some
are updated once a month, some can be used live, but all are in high demand
by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost
grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not
really high-speed; my on-demand tools have apparently been shut out
recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a
list of pages on many Wikimedia projects; I need view counts for these
pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get
to the data, at least for the monthly stats?
There's discussion at
https://bugzilla.wikimedia.org/show_bug.cgi?id=44448 about how skin
usage correlates with who's an active editor.
It would be great to know what percentage of active editor (5+ edits in
the main namespace) uses each skin on English Wikipedia. Perhaps for
the last three months.
Fundraising is proposing to an experiment to model user behavior on our
properties. I've written an RfC on exactly what I'm proposing here . I
would love any comments/concerns/methodology changes/and additional
considerations you might have.
Fundraising Technology Team
Apologies for crossposting
The Analytics Team is planning to deploy "tab as field delimiter" to
replace the current space as fielddelimiter on the varnish/squid/nginx
servers. We would like to do this on February 1st. The reason for this
change is that we need to have a consistent number of fields in each
webrequest log line. Right now, some fields contain spaces and that require
a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
* udp-filter already has support for the tab character
* webstatscollector: we compiled a new version of filter to add support for
the tab character
* wikistats: we will fix the scripts on an ongoing basis.
* udp2log: we have a patch ready for inserting sequence numbers separated
In particular, I would like to have feedback to three questions:
1) Are there important reasons not to use tab as field delimiter?
2) Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
3) Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
I'm trying to find some page view stats for Wikipedia articles broken down
by subject matter (like 'history', or 'science'). So I would like to be
able to find out what fraction of total Wikipedia page views are for
articles belonging to a particular category like 'History' or "The Arts".
If it helps, an example of the kind of information I was hoping for can be
found in this video of Jimmy Wales
https://www.youtube.com/watch?v=IhumTKbmdFs (chart is shown at time 12:40).
I would be really grateful for any suggestions on getting this kind of data
Of interest to some, Jun Rao gave a talk at ApacheCon about Kafka
replication (scheduled to land in 0.8 in March). I've pulled out
some bits perhaps of interest.
Updated stats about LinkedIn's experience with Kafka:
- Writes: >10B messages/day (>2TB compressed data)
- Reads: >50B messages/day (>1PB compressed data)
- Typical failover time after a broker failure: <10ms
Slides 14, 18-20 talk about its replication model for eventual consistency,
interesting as it intentionally makes tradeoffs to take advantage of
intra-datacenter latency being an order of magnitude(ish) better than that
between DCs connected by the open internet. In exchange for some extra
chatter, they tolerate 2f failures among 2f+1 replicas. Clever, and clearly
it works for them. (See slides 21-22 for unhelpful diagrams, 27-31 for
interesting performance numbers, excepting slide 28's totally inexplicable
durability column using highly scientific measures like "some data loss" vs
"a few data loss". What.)
Pretty neat stuff, and it's great to see a built-in solution for cross-DC
 http://kafka.apache.org/ -- now out of incubator!
We're organizing a half-day workshop for engineers, analysts, PMs and other parties interested in learning how to use EventLogging.
EventLogging  is a MediaWiki extension developed by the E3 team that allows the collection of data on how users interact with our site. It's been largely adopted in Product/Mobile/Feature engineering to run A/B tests and to evaluate experimental features but can be used more generally to identify usability problems and to collect data to inform feature design.
Whether you are already planning to use EventLogging for an existing project or you are just curious to learn how it works, the session will cover a typical workflow:
1) turning an idea into a data model
2) instrumenting MediaWiki to log events
3) accessing and QA'ing log data
4) performing simple log data analysis
The workshop  will be hosted at the Wikimedia Foundation (Collab space, 6th floor) on March 7 between 1.30pm-5pm. The whole E3 engineer line-up will be in the office to provide hands-on demos and tutorials. If you are interested in attending, please sign up on the workshop page. The session will be recorded but we're not currently planning to stream it.
We are glad to see that more people are finding their way to this
mailinglist and that's really cool! Often, a thread will involve a request
for a new dataset, a bugfix or a new feature. I am trying to keep track of
all these requests as best as I can but what would really help is if you
can file a request in Bugzilla under Product Analytics and then use either
the General Component (if you don't know exactly where it should go) or use
the appropriate component. If a component is missing then please let me
know and I will get it added.
Thanks for your cooperation and keep those requests coming!