Tried Wikimetric today and it looks like a good start to me. Some feedback:
* Google/Twitter account, should be something WMF like the labs/Gerrit LDAP
* Should use https by default
* O wait, invalid certificate, filed bug at
* Only English? It should be multilingual like all our software. The
people at translatewiki will be happy to translate for you
* Upload csv user lists is not very convenient. Are you planning to come
up with a easier/better system?
* Project "en" is a bit weird. You're probably using <project>wiki_p for
the database. Can you add a link to available projects? Or how to
construct it? Say for example I want the German Wikivoyage.
* Description seems to be missing for some fields at
* You could probably grab namespaces on the fly from the Mediawiki api
* Can you add an option to give output per time period (month would be
* Can you add bytes uploaded as a metric?
* Can you split out the result per namespace?
* http://metrics.wmflabs.org/support contains a to the empty page
http://www.mediawiki.org/wiki/Wikimetrics/FAQ . Can you make that link
https by default?
* Where is the code? Can we submit new metrics? See for example
http://toolserver.org/~reports/?wiki=nl.wikipedia.org for a similar service
* Are you planning to offer some visual output besides csv/json? See for
* I see you have sql queries. What tables are available? All
(non-private) tables like on the Toolserver and Toollabs?
* Do you have some metrics on the usage of wikimetrics? :-)
I would like to delete some of our Gerrit repo's that we don't use at all
and just clutter up our projects. I propose to delete the following repo's:
* analytics/DeviceMapLogCapture (a shortlived experiment for device
recognition by Patrick Reilly, we are using OpenDDR now)
* analytics/debs/kafka-0.7.2 (this is an old test version of kafka, the new
repo lives under operations)
* analytics/dclass an old dclass repo with debian stuff, the new one lives
* analytics/graphkit - i think this is a precursor to Limn, the description
says "IGNORE THIS REPO"
* analytics/global-dev/sqproc - a small repo from evan, his work is all
available under github.com/embr
* analytics/global-dev/reportcard - message "IGNORE THIS REPO"
* analytics/reportcard/old-pipeline - a test in python to replace
wikistats, never got far.
* analytics/user-metrics-2 - not used AFAIK
* analytics/packages/thrift This repository has been abandoned in favor of
* analytics/E3Analysis - the correct repo for UMAPI is
Please chime in if you disagree with any of the proposals. Once we have
consensus I will ask Chad to delete them.
Today a new recruit has joined the Analytics Team: CommanderData!
Not surprisingly, it's IRC handle is CommanderData and it's duties are:
1) helping to minglify the Analytics Team to unprecedented heights.
2) answer random why questions.
>From now on, entering a # followed by the mingle card number like #1112
CommanderData and reply using a link to the mingle card.
I just spent some time playing with Hive and JSON today, and I think I finally have a grasp on all of the items and questions that are left to make this actually happen. I'm writing them down here to summarize for you and for my own brain :)
-- compression support (snappy?)
-- puppet module
-- local puppetization (with our JSON logging format nailed down).
-- Packaged and installed on mobile hosts via puppet.
- Kafka 0.8 Brokers
-- 0.8 package in apt.wikimedia.org (Alex K is going to do this for me soon).
-- Repave analytics1021 and analytics1022, install Kafka brokers via puppet.
-- Figure out how to deploy and run this:
Shadow Jar? Puppetized cronjob? Oozie?
-- If needed, implement geocoding and anonymization as part of
Camus ETL phase. This could also be done as an after the fact Pig or MR
job scheduled by oozie.
-- Do Hadoop compression settings automatically work when writing
to HDFS from Camus?
-- How do we properly deploy and use hive-serdes-1.0-SNAPSHOT.jar?
-- Determine proper webrequest Hive schema based on final
varnishkafka JSON log format. Put this in Kraken repo somewhere?
-- Write oozie job for creating Hive partitions after Camus imports.
I have been talking with a lot of you in the past months and at Wikimania
about Limn and how to move forward. One of the recurring themes has been
that currently Limn is written in Coca and that significantly hinders
adoption as there are very few Coco developers (Coco is a fork of
I have sent this email to mobile-tech, e2 and e3 mailinglists as well
because there are many developers outside of the Analytics team who use
Limn and I would really like to hear their opinion as well.
So the question I want to pose is:
This question is getting more urgent because of two reasons:
1) The Analytics team is going to grow in the coming months and we expect
to start developing features for Limn again and if we want to drop Coco as
dependency then this is probably the best time to talk about it.
2) It seems that the community around Coco is stagnant maybe even on the
decline. When visiting https://github.com/satyr/coco you can see that
there are very few commits in the last 4 months. This could either mean
that the language is feature complete and bug free or more likely that the
decline has started. For the long-term prospects of Limn, this is not good
I would like to run a strawpoll and please respond to this thread by
I'm hoping to provide a data stream and archival data for edit conflict
events on *.wikipedias. The short-term goal is to help support further
research into heuristic reconstruction of the article revision graph, see
this paper presented by Jianmin Wu (author CC'ed here):
The only marker I have found so far is, unfortunately, a message emitted
using wfDebug. Do we have an archive of production debug logs, and what is
the process I would follow for proposing a historical experiment or an
ongoing filter using this data?
For anyone who's curious, I think the main string I'm looking for is
"Keeping edit conflict, failed merge.", but it would be worthwhile to
analyze logging from every code path within conflict resolution.