I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
This discussion is about needed updates of the definition and Analytics
implementation for mobile apps page view metrics. There is also an
associated Phab task. Please add the proper Analytics project there.
Background / Changes
As you probably remember, the Android app splits a page view into two
requests: one for the lead section and metadata, plus another one for the
The mobile apps are going to change the way they load pages in two
1. We'll add a link preview when someone clicks on a link from a page.
2. We're planning on switching over the using RESTBase for loading pages
and also the link preview (initially just the Android beta, ater more)
This will have implications for the pageviews definition and how we count
The big question is
Should we count link previews as a page view since it's an indication of
user engagement? Or should there be a separate metric for link previews?
Counting page views
IIRC we currently count action=mobileview§ions=0 query parameters of
api.php as a page view. When we publish link previews for all Android app
users then we would either want to count also the calls to
action=query&prop=extracts as a page view or add them to another metric.
Once the apps use RESTBase the HTTPS requests will be very different:
- Page view: Instead of action=mobileview§ions=0 the app would call
the RESTBase endpoint for lead request instead of the PHP API mentioned
above. Then it would call .
- Link preview: Instead of action=query&prop=extracts it would call the
lead request, too, since there is a lot of overlap. At least that our
current plan. The advantage of that is that the client doesn't need to
execute the lead request a second time if the user clicks on the link
preview (-- either through caching or app logic.)
So, in the RESTBase case we either want to count the
mobile-html-sections-lead requests or the
depending on what our definition for page views actually is. We could also
add a query parameter or extra HTTP header to one of the
mobile-html-sections-lead requests if we need to distinguish between
previews and page views.
Both the current PHP API and the RESTBase based metrics would need to be
compatible and be collected in parallel since we cannot control when users
update their apps.
A number of us are discussing the year to date editor population stats.
When can we anticipate seeing the August stats? It would be helpful to have
them be published at least a week before the publication of the monthly
Recent Research report for September.
I've been asked in a private email why WMF forked ua-parser 
(a library used to extract information from User-Agents headers).
There is no need to discuss this is private, hence I am replying to
the mailing list.
TL;DR: It was no real fork. We just worked around issues with
upstream's release management.
What follows is a bit detailed. But given the context I decided to
better err on being over-verbose.
Back in October 2014, WMF pushed towards analyzing User-Agent headers
in the logs to for example allow more accurate estimations of how many
requests WMF sees from Android vs. iPhone devices, which Browsers get
used in which version etc.
Extracting information from User-Agents is a bit tricky as there are
quite some corner cases. So it was decided to use a third-party
library for it. ua-parser  got chosen for this purpose.
ua-parser comes with a Java build, so it naturally matched the log
processing's Java eco-system. However, (at least) back then ua-parser
did not offer compelling prebuilt jars, and ua-parser's versioning and
release cycle of the Java part was broken.
The latest release was about a year old, and no proper release was in
sight. So all upstream gave us was a jar versioned as
Deploying such jar to the cluster is a bad idea, as its name does not
give a clue on which commit it is based. For this concrete setting,
there would be about 250 commits in ua-parser that would produce the
same version number. That would make debugging hard and nix
Since WMF cannot do a proper release for ua-parser, the typical
workaround for WMF in such cases is to produce a “wmf” branch in
Gerrit and do “wmf” releases at known commits. And that's what the
ua-parser “fork” in Gerrit does.
Comparing upstream with the “fork” in Gerrit, the only difference is:
That commit allows for a wmf release, is tagged 1.3.0-wmf1 and results
in an artifact name of
which (due to the 1.3.0-wmf1 tag) is good for releasing .
As one of the questions in the private email was whether WMF could
switch back to upstream ... I hope you see that WMF never switched
away from upstream and WMF never “forked” upstream. WMF only rolled
their own release.
If upstream now provides proper releases, sure, just use them :-)
* How can I find out who actually created a repository?
Look at the first commit to the meta/config branch. Like here:
* How can I see the difference between branches?
Use `git cherry` (Yes, really. Just “cherry”, no trailing “-pick”)
An example session is at .
* How could one have found out about the wmf1 thing?
For example from the IRC logs of the day from the commit :
[20:23:08] <ottomata> we can just make wmf1 be our release of the current master?
[20:23:13] <qchris> k
 Back then at
now the relevant repos for WMF seem to be at
 It made it into archiva:
into the refinery-hive jars:
and also to the cluster:
christian@spencer // jobs: 0 // time: 21:40:28 // exit code: 0
git clone https://github.com/tobie/ua-parser
Cloning into 'ua-parser'...
remote: Counting objects: 4507, done.
remote: Total 4507 (delta 0), reused 0 (delta 0), pack-reused 4507
Receiving objects: 100% (4507/4507), 4.31 MiB | 923 KiB/s, done.
Resolving deltas: 100% (2301/2301), done.
christian@spencer // jobs: 0 // time: 21:41:10 // exit code: 0
christian@spencer // jobs: 0 // time: 21:41:14 // exit code: 0
git remote add gerrit https://gerrit.wikimedia.org/r/analytics/ua-parser
christian@spencer // jobs: 0 // time: 21:41:33 // exit code: 0
git fetch gerrit
remote: Finding sources: 100% (4/4)
remote: Total 4 (delta 3), reused 4 (delta 3)
Unpacking objects: 100% (4/4), done.
* [new branch] master -> gerrit/master
* [new branch] wmf -> gerrit/wmf
* [new tag] v1.3.0-wmf1 -> v1.3.0-wmf1
christian@spencer // jobs: 0 // time: 21:41:38 // exit code: 0
git cherry origin/master gerrit/master
christian@spencer // jobs: 0 // time: 21:42:10 // exit code: 0
git cherry origin/master gerrit/wmf
christian@spencer // jobs: 0 // time: 21:42:17 // exit code: 0
git cherry origin/master v1.3.0-wmf1
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Heyo, Discovery team!
This is just a quick writeup of the Scaleable Event Systems meeting
that Erik, Dan, Stas and I went to (although just from my
For people not in the initial thread, this is a proposal to replace
the internal architecture of EventLogging and similar services with
Apache Kafka brokers
(http://www.confluent.io/blog/stream-data-platform-1/ ). What that
means in practice is that the current 1-2k events/second limit on
EventLogging will disappear and we can stop worrying about sampling
and accidentally bringing down the system. We can be a lot less
cautious about our schemas and a lot less cautious about our sampling
It also offers up a lot of opportunities around streaming data and
making it available in a layered fashion - while we don't want to
explore that right now, I don't think, it's nice to have as an option
when we better understand our search data and how we can safely
I'd like to thank the Analytics team, particularly Andrew, for putting
this together; it was a super-helpful discussion to be in and this
sort of product is precisely what I, at least, have been hoping for
out of the AnEng brain trust. Full speed ahead!