Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location
dataset from Evan (Rosen) that has the following column structure:
*start* and *end* denote the beginning and ending date for counting the
number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that
is it gives the ratio of 'total edits from that city for that language
Wikipedia project' divided 'total edits from that country for that
language Wikipedia project'. Hence, it cannot be used to understand
global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted)
should also have the global fractions -- total edit from a city divided
by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML
dumps? Or, even better, is the relevant raw data available in CSV form
On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org
> Send Analytics mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
> Date: Tue, 14 May 2013 19:40:00 +0200
> From: "Erik Zachte" <ezachte(a)wikimedia.org>
> To: "'A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics.'"
> Subject: Re: [Analytics] Visualizing Indic Wikipedia projects.
> Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org>
> Content-Type: text/plain; charset="iso-8859-1"
> Awesome work! I like the flexibility of the charts, easy to switch metrics
> and presentation mode.
> 1. WMF has never captured ip->geo data on city level, but afaik this is
> going to change with Kraken.
> 2. Total edits per article per year can be derived from the xml dumps. I may
> have some csv data that come in handy.
> For edit wars you need track reverts on an per article basis, right? That
> can also be derived from dumps.
> For long history you need full archive dumps and need to calc checksum per
> revision text. (stub dumps have checksum but only for last year or two)
> Erik Zachte
On Wed, Nov 13, 2013 at 5:09 PM, Jon Robson <jrobson(a)wikimedia.org> wrote:
> Thanks so much Juliusz for exploring this and great work fixing the
> schema (apologies for me not predicting that might be an issue) and
> sorry for all the pain this must have caused you.
> We can't be the only teams using Limn in the Foundation. It might be
> worth pulling everyone together. Am I right in thinking that Limn is a
> child of the analytics team? Maybe we should at least spend some with
> them getting our use case resolved.. I guess this is why we have an
> analytics department? I can raise this issue in the next Scrum of
> Scrums if it is not resolved by then.
> On Wed, Nov 13, 2013 at 3:54 PM, Juliusz Gonera <jgonera(a)wikimedia.org>
> > For the past few days (or more) graphs at
> > http://mobile-reportcard.wmflabs.org/ stopped updating. The dashboard
> > consists of two parts: Limn, which displays the data, and backend scripts
> > that generate the graph data based on Event Logging data. The issue was
> > caused by two independent problems in the second component:
> > 1. A change of MobileWebEditing schema was incorrectly addressed in the
> > scripts' config and caused the script to throw an exception.
> > 2. Backend scripts are stupid and not optimized at all.
> > The first thing is fixed. To work around the second thing I had to
> > updates of "Editors registered on mobile who made 5+ edits on enwiki
> > (mobile+desktop)" graph  for now (the query was timing out and
> causing an
> > exception too) and removed the performance graph, since we'll be using
> > ganglia (and soon graphite) for that . Graphs should get updated soon.
> > So why are those backend scripts stupid? Because they run every hour and
> > recalculate _all_ the values for every single graph. For example, even
> > though total unique editors for June 2013 will never change, they are
> > recalculated every hour. This was a quick and easy solution for
> > graphs, but as Event Logging tables keep growing, we add more graphs and
> > those graphs show more and more data, it's no longer performing.
> > I discussed this briefly with Ori and I think we agree on the general
> > direction. We should definitely schedule some time for working on this.
> > could start with a spike investigating if there is a framework for
> > aggregating the sums that we could use and asking what other teams in the
> > foundation use for generating their graph data. The results of this spike
> > and possible following work could be useful not only for the mobile team.
> >  https://gerrit.wikimedia.org/r/#/c/95298/
> > 
> > --
> > Juliusz
Software Engineer, Mobile
We've been having spikes in our 5xx error logs since yesterday. There
are definitely multiple distinct causes for those, incl. esams network
issues, random people trying to DoS us, MediaWiki bugs that got
backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this
GET \\nki/Random_article HTTP/1.1
That's GET space backslash newline ki/Random_article ("Random_article"
being an example). This makes Varnish think the URL is "\" and
"ki/Random_article HTTP/1.1" some random malformed header and so it
responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is
2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our
logs, for all of November that I checked. Since then and until now
we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends --
they are not passing through our SSL terminators or special proxies like
Opera Mini. You can see e.g. a sample filtered pcap at
fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has
private data, do not share). The packets' TCP checksum is obviously
Those requests always are for en.wikipedia.org articles, no other
languages or projects. They come from all user-agents & operating
systems (so, probably not a malware). They have all kind of Referers,
including internal links. About 3/4 are coming from Google, but this
isn't irregular. Some of them have proper Cookies, including session
tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different
countries. The distribution by country is the most interesting part;
the top 5 of unique IPs reads:
i.e. 85% comes from India -but not a particular ISP-, in a >24h
The distribution of hits per datacenter is:
78938 eqiad (incl. 72516 for India)
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device
or, alternatively, state surveillance device (e.g. BlueCoat), has gone
haywire and is corrupting HTTP requests (paranoia about state
surveillance was one of the reasons I kept digging). Some parts don't
fit in either theory (traffic is distributed across both DCs & multiple
countries for state surveillance; requests are too targetted to enwiki
Other thoughts? Am I missing something completely obvious?
There are some changes in the Analytics team I'd like to let you know about.
First of all, I'd like to congratulate Dario Taraborelli on his promotion
to Senior Research Scientist, Research and Data team lead. Dario has made
significant contributions to the Foundation and Community's use of data and
analysis over the past several years and has been instrumental in
establishing the charter of the new Research and Data team in Analytics.
I'm really excited about Dario's increased scope and visibility in his new
role both in and outside of the Foundation.
In addition, Aaron Halfaker has been given the new title
of Research Scientist. This new role recognizes his expertise and
scientific contributions as a researcher and now as a full-time member of
the Research and Data team at the Foundation.
Please join me in congratulating Dario and Aaron!
I also want to announce that Diederik van Liere will be leaving the
Analytics team and is exploring other roles within the Foundation.
Currently, he is helping Rob Lanphier with the planning of the Architecture
summit. I will be handling the Product Management duties pending the hiring
of a new Product Owner.
Diederik played a significant role in introducing both Hadoop and Kafka to
the Foundation's technology stack, and in introducing Scrum to the
Analytics team. I would like to thank Diederik for his 2 years of product
leadership in Analytics and his contributions to the Foundation and the
It's not clear if this is a bug or true organic growth, but it seems to
be occurring across multiple Wikipedias (see the rest of the thread).
-------- Original Message --------
Subject: [Wikimedia-l] Increase in page views for the last 3 months
Date: Fri, 22 Nov 2013 23:41:09 +0200
From: Strainu <strainu10(a)gmail.com>
Reply-To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org>
To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org>
Looking at the summary reports per language, I've noticed a linear,
significant increase in pageviews for many European languages (ro, bg,
hu, fr) Wikipedias in the last 3 months. This is not happening for
Asian languages or Russian and is not obvious from the report card.
Has anything changed in the reporting or the visit patterns for these
Wikipedias? It looks pretty weird to have a 100% increase for Romanian
in just 3 months .
Wikimedia-l mailing list
On Thu, Nov 21, 2013 at 10:43 AM, EventLogging Alerts
> Chris needs to replace a failed disk on vanadium, which should only
> take 5-10 minutes. I will follow up to indicate the exact start time
> and duration of the outage.
Maintenance complete; outage was 8 minutes: Thu Nov 21 18:48 UTC to 18:56 UTC.
The Research & Data team is currently experimenting with a tool called Trello for tracking progress and simplifying monthly reporting .
We don’t have a good solution for tracking progress on research/data support requests originating from the community or from non-WMF researchers. Using the same board for these requests is not going to work:
the board is currently set up as read-only for non-WMF users
it mostly reflects work prioritized by the team as part of our quarterly planning  and it’s not designed as a generic inbox for data requests
repurposing the board as a generic backlog would set the wrong expectations that the team has bandwidth or a mandate to support these requests as they come in
What if we set up a public (read/write accessible) board where anyone (including volunteers) can create, pick up, execute and complete requests? The purpose of this would be purely to categorize, track and (self-)assign or reassign tasks: the actual requirements and the output of a request would be hosted on Meta (for example in the Research Index or the Labs2 portal) and/or in a public data repository.
How do people feel about this? We also have a bugzilla component for generic analytics requests that people have been using for a while  but I don’t think it has been particularly successful because BZ is mostly focused on development and bug reports or feature requests for analytics infrastructure.
The bottom line is that I don’t want to create more work for WMF researchers – we are a small team of 2.5 FTE staffers supporting the whole organization, if we exclude WMF analysts that are not part of Analytics – but test if a lightweight tool like Trello can be used to distribute tasks and track progress on a body of research and data requests.