Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location
dataset from Evan (Rosen) that has the following column structure:
*start* and *end* denote the beginning and ending date for counting the
number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that
is it gives the ratio of 'total edits from that city for that language
Wikipedia project' divided 'total edits from that country for that
language Wikipedia project'. Hence, it cannot be used to understand
global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted)
should also have the global fractions -- total edit from a city divided
by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML
dumps? Or, even better, is the relevant raw data available in CSV form
On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org
> Send Analytics mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
> Date: Tue, 14 May 2013 19:40:00 +0200
> From: "Erik Zachte" <ezachte(a)wikimedia.org>
> To: "'A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics.'"
> Subject: Re: [Analytics] Visualizing Indic Wikipedia projects.
> Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org>
> Content-Type: text/plain; charset="iso-8859-1"
> Awesome work! I like the flexibility of the charts, easy to switch metrics
> and presentation mode.
> 1. WMF has never captured ip->geo data on city level, but afaik this is
> going to change with Kraken.
> 2. Total edits per article per year can be derived from the xml dumps. I may
> have some csv data that come in handy.
> For edit wars you need track reverts on an per article basis, right? That
> can also be derived from dumps.
> For long history you need full archive dumps and need to calc checksum per
> revision text. (stub dumps have checksum but only for last year or two)
> Erik Zachte
We spoke a little bit more about getting a queryable public interface for
pageview data up and running and we decided the following:
1) Start importing webstatscollector pageview daily data for 2013 into
mysql running on labs (not yet scheduled in a sprint)
2) Make simple datawarehouse schema for mysql db (based on the current
page_id (FK -> Page Table)
3) Collect more datapoints to determine how high of a priority mobile site
article pageview counts are to decide whether we should add this to
webstatscollector or not.
Henrik updated the top view charts and few days ago foundationwiki was
added to webstatscollector. http://stats.grok.se/www.f/top shows
Most viewed articles in 201304
Rank Article Page views
1 Trang chủ 912
2 Portada galega 324
3 Home 182
4 Local chapters 172
This seems highly unlikely, is the problem known?
WMF researchers have agreed to participate in an office hour. This
will be in the same format as the meeting we had in April 2013 with
researcher introductions followed by open Q&A and discussion.
The currently scheduled participants are:
* Henrique Andrade, Brazil Data and Experiments Consultant (Grantmaking Catalyst programs)
* Aaron Halfaker, Research Analyst (Analytics)
* Jonathan Morgan, Learning Strategist (Grantmaking Learning and Evaluation)
* Aaron Shaw, Assistant Professor, School of Communication, Northwestern University
* Dario Taraborelli, Senior Research Analyst, Strategy (Analytics)
meeting will be on IRC in #wikimedia-office on Monday, September 23 at
1800 UTC / 1100 PST. Please spread the word and join if you are
Summary: we have some new stats regarding gadget usage across WMF sites,
but I'd like more analysis of gadget & bot usage.
Oliver Keyes has some code and results up at
analyze "data around gadgets being used on various wikimedia projects":
"GadgetUsage.r is the generation script. It is dependent on (a) access
to the analytics slaves and (b) the list of databases
"gadget_data.tsv is the raw data, consisting of an aggregate number of
users for each preference on each wiki, with preference, wiki and wiki
type (source, wiki, versity, etc) defined.
"gadgets_by_wikis.tsv is a rework of the data to look at what gadgets
are used on multiple wikis, and how many wikis that is. It also includes
an aggregate of the number of users across those wikis using the gadget.
"wikis_by_gadgets.tsv is a rework that looks at the number of distinct
gadgets on each individual wiki. Unsuprisingly there's a power law."
This helps a lot with addressing one of the analytics "dreams" from
https://www.mediawiki.org/wiki/Analytics/Dreams - "What proportion of
logged-in editors have activated any gadgets at all? What are the most
popular gadgets?" However, Oliver's data "is based on preference data -
it may or may not include data for those gadgets set as defaults." So
if someone could improve this to ensure that we appropriately count
gadget usage for gadgets that default to on, that would be very helpful.
My team would also like to know:
* who maintains the most popular gadgets? (so we can invite them to
hackathons, help get them training, get those gadgets localised and
ported to other wikis, and so on)
* when were the gadgets last updated? (so we can identify stale ones
that enthusiastic volunteers could take over maintaining)
* similar stats regarding bot usage -- what bots are making the most
edits, or edits that in aggregate change the most bytes? who owns those
bots? what wikis are they active on? (so we can help maintainers better,
ensure they hear about API breaking changes, etc., and develop a bot
inventory/directory to make it easier for other wikis' users to start
using useful bots)
If there's anyone interested in taking this on, either inside or outside
WMF's Analytics team, that would be great. Otherwise I anticipate that
Engineering Community Team will take it on sometime in the
October-December 2013 period.
Engineering Community Manager
We still have Usermetrics / UMAPI running on stat1001. AFAICT, nobody is
actually using this instance. Dario et al. are using the version installed
on stat1. I would like to move ahead with uninstalling it to so we can
start making preparations for deploying Wikimetrics on stat1001 under the
Any objections? Please voice them!
our current definiton of „active editor” :
An 'active editor' is a registered (and signed in) person (not known
as a bot) who makes 5 or more edits in any month in mainspace on
is centered around months. That's good.
However, as we are seeing requests to produce daily graphs: How to
interpret the above definition in terms of active editors for a given
Especially: How to do it in a way that blends nicely with the current
P.S.: I've seen code in our repos that just looks for edits of the
last 30 days. That sounds nice. But if I am doing 3 edits on
2013-07-01, and another 3 on 2013-07-31, I would not be considered
active editor by this daily approach for any day. However, I'd be an
active editor for July using  :-/
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63