Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location
dataset from Evan (Rosen) that has the following column structure:
*start* and *end* denote the beginning and ending date for counting the
number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that
is it gives the ratio of 'total edits from that city for that language
Wikipedia project' divided 'total edits from that country for that
language Wikipedia project'. Hence, it cannot be used to understand
global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted)
should also have the global fractions -- total edit from a city divided
by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML
dumps? Or, even better, is the relevant raw data available in CSV form
On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org
> Send Analytics mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
> Date: Tue, 14 May 2013 19:40:00 +0200
> From: "Erik Zachte" <ezachte(a)wikimedia.org>
> To: "'A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics.'"
> Subject: Re: [Analytics] Visualizing Indic Wikipedia projects.
> Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org>
> Content-Type: text/plain; charset="iso-8859-1"
> Awesome work! I like the flexibility of the charts, easy to switch metrics
> and presentation mode.
> 1. WMF has never captured ip->geo data on city level, but afaik this is
> going to change with Kraken.
> 2. Total edits per article per year can be derived from the xml dumps. I may
> have some csv data that come in handy.
> For edit wars you need track reverts on an per article basis, right? That
> can also be derived from dumps.
> For long history you need full archive dumps and need to calc checksum per
> revision text. (stub dumps have checksum but only for last year or two)
> Erik Zachte
Henrik updated the top view charts and few days ago foundationwiki was
added to webstatscollector. http://stats.grok.se/www.f/top shows
Most viewed articles in 201304
Rank Article Page views
1 Trang chủ 912
2 Portada galega 324
3 Home 182
4 Local chapters 172
This seems highly unlikely, is the problem known?
Summary: we have some new stats regarding gadget usage across WMF sites,
but I'd like more analysis of gadget & bot usage.
Oliver Keyes has some code and results up at
analyze "data around gadgets being used on various wikimedia projects":
"GadgetUsage.r is the generation script. It is dependent on (a) access
to the analytics slaves and (b) the list of databases
"gadget_data.tsv is the raw data, consisting of an aggregate number of
users for each preference on each wiki, with preference, wiki and wiki
type (source, wiki, versity, etc) defined.
"gadgets_by_wikis.tsv is a rework of the data to look at what gadgets
are used on multiple wikis, and how many wikis that is. It also includes
an aggregate of the number of users across those wikis using the gadget.
"wikis_by_gadgets.tsv is a rework that looks at the number of distinct
gadgets on each individual wiki. Unsuprisingly there's a power law."
This helps a lot with addressing one of the analytics "dreams" from
https://www.mediawiki.org/wiki/Analytics/Dreams - "What proportion of
logged-in editors have activated any gadgets at all? What are the most
popular gadgets?" However, Oliver's data "is based on preference data -
it may or may not include data for those gadgets set as defaults." So
if someone could improve this to ensure that we appropriately count
gadget usage for gadgets that default to on, that would be very helpful.
My team would also like to know:
* who maintains the most popular gadgets? (so we can invite them to
hackathons, help get them training, get those gadgets localised and
ported to other wikis, and so on)
* when were the gadgets last updated? (so we can identify stale ones
that enthusiastic volunteers could take over maintaining)
* similar stats regarding bot usage -- what bots are making the most
edits, or edits that in aggregate change the most bytes? who owns those
bots? what wikis are they active on? (so we can help maintainers better,
ensure they hear about API breaking changes, etc., and develop a bot
inventory/directory to make it easier for other wikis' users to start
using useful bots)
If there's anyone interested in taking this on, either inside or outside
WMF's Analytics team, that would be great. Otherwise I anticipate that
Engineering Community Team will take it on sometime in the
October-December 2013 period.
Engineering Community Manager
we are currently serving a few hundred graphs, dashboards, ... at
, but running the various scripts that generate them is a bit shaky
and their maintenance is eating up a considerable amount of time.
So in order to better use resources, and limit maintenance work, we're
curious about which parts, URLs, dashboards, graphs, datasources of
the site are actually in use by people in one way or the other.
If you rely on parts, URLs, dashboards, graphs, datasources of
please let us know by August 30.
P.S.: We may think about removing unused parts or stop even trying to
update them. So if you are using some parts, please do let us know :-)
P.P.S.: We already reached out to the users that we know of. So do not
feel pressed to reply again, if you have already replied to the
private email about this issue.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
I have been talking with a lot of you in the past months and at Wikimania
about Limn and how to move forward. One of the recurring themes has been
that currently Limn is written in Coca and that significantly hinders
adoption as there are very few Coco developers (Coco is a fork of
I have sent this email to mobile-tech, e2 and e3 mailinglists as well
because there are many developers outside of the Analytics team who use
Limn and I would really like to hear their opinion as well.
So the question I want to pose is:
This question is getting more urgent because of two reasons:
1) The Analytics team is going to grow in the coming months and we expect
to start developing features for Limn again and if we want to drop Coco as
dependency then this is probably the best time to talk about it.
2) It seems that the community around Coco is stagnant maybe even on the
decline. When visiting https://github.com/satyr/coco you can see that
there are very few commits in the last 4 months. This could either mean
that the language is feature complete and bug free or more likely that the
decline has started. For the long-term prospects of Limn, this is not good
I would like to run a strawpoll and please respond to this thread by
I'm hoping to provide a data stream and archival data for edit conflict
events on *.wikipedias. The short-term goal is to help support further
research into heuristic reconstruction of the article revision graph, see
this paper presented by Jianmin Wu (author CC'ed here):
The only marker I have found so far is, unfortunately, a message emitted
using wfDebug. Do we have an archive of production debug logs, and what is
the process I would follow for proposing a historical experiment or an
ongoing filter using this data?
For anyone who's curious, I think the main string I'm looking for is
"Keeping edit conflict, failed merge.", but it would be worthwhile to
analyze logging from every code path within conflict resolution.
A belated Sprint email. Apologies!
Slidedeck is available at
## Defects & Features completed (Ready for Showcase/Shipping/Done) during
Sprint ending 2013-08-21 ##
385 Migration of stat1 (pmtpa) to stat1002 (eqiad) Infrastructure Task Diederik
van Liere - Analytics Logging Infrastructure
429 View detailed list of jobs / requests in queue Feature Dario
Taraborelli - Product (E3) Wikimetrics
768 High Availability NameNode Infrastructure Task Operations Kraken
817 Use new dclass-api in Kraken Defect Tomasz Finc - Product (Mobile Web &
820 Reportcard June 2013 Feature Erik Moeller - Executive Office Limn
822 Aggregation of Metric results Feature Dario Taraborelli - Product (E3)
824 Support page Feature Frank Schulenburg - Grantmaking Wikimetrics
827 Re-enable global south active editors dashboard Defect Frank
Schulenburg - Grantmaking Limn Dashboards
1022 Non-serializable JSON output Defect Frank Schulenburg - Grantmaking
933 Setup Kafka and Camus in Labs Infrastructure Task Diederik van Liere -
Analytics Logging Infrastructure
1025 [Community] Repair gerrit's Json output for gsql Defect Platform
1069 [Bug 52749] 500 error when canceling Google OAuth authentication
van Liere - Analytics Wikimetrics
1072 Investigate drop in number of web requests for mobile-100 stream
- Wikipedia Zero Logging Infrastructure
1077 Outdated datasources on gp.wmflabs.org Defect Jessie Wild - Learning &
Evaluation Limn Dashboards
## Current Sprint (ending 2013-09-04) ##
731 Reinstall Hadoop Nodes Kraken Diederik van Liere - Analytics 8
704 Measure pages created by an editor Wikimetrics Jaime Anstee -
733 Reinstall Zookeeper Nodes Kraken Diederik van Liere - Analytics 3
735 Reinstall Ciscos 1005-1009 Kraken Diederik van Liere - Analytics
760 Debianize Librdkafka Logging Infrastructure Operations 3
823 Increase unit-test coverage Wikimetrics Diederik van Liere - Analytics 8
1023 Take mobile jobs off of Hadoop Kraken Tomasz Finc - Product (Mobile
Web & Apps) 5
1079 Monitoring geowiki dashboards Limn Dashboards Jessie Wild - Learning &
1081 Repave Hadoop Cluster Kraken Diederik van Liere - Analytics
1089 Add metadata to CSV output Wikimetrics Jaime Anstee - Grantmaking 1
1092 Fix global south editor fractions Limn Dashboards Jessie Wild -
Learning & Evaluation 1
1093 Setup kafka failover modes Diederik van Liere - Analytics
1094 Create geomap of https failures Limn Dashboards Dario Taraborelli -
Product (E3) 1
1111 Reportcard July 2013 Limn Dashboards Erik Moeller - Executive Office 1
(Number in parentheses) = estimate of complexity
N/E = not estimated;
F = Feature
D = Defect
I = Infrastructure Task
S = Spike
Any mingle card can be accessed using the base url
https://mingle.corp.wikimedia.org/projects/analytics/cards/XYZ where XYZ is
the Mingle card id.
If you have any questions, comments or feedback: please let us know!
Apologies for cross-posting; ideally you should receive this on the
Analytics Mailinglist so we can have one focal point for conversation. If
you are not on the Analytics list then please subscribe at
Hi, I'm scanning the gerrit repo to extract the projects that we want to
scan for our tech community metrics.
There is a lot of stuff under analytics/
Should we include all of it or are there repositories that we can ignore?
What are discarding in general:
* Upstream projects that we simply repackage or fork with a few patches.
* Data (as opposed to code) that would just bloat our metrics.
* Sandboxes and personal experiments.
PS: this will be also useful to update
https://www.ohloh.net/p/wmf-analytics (are you really managing 1M lines
Technical Contributor Coordinator @ Wikimedia Foundation
A lot of us who write Python code use the multiprocessing module because
it's an easy way to distribute the workload among many cpu's. But when you
do, please do not allocate all cores to your jobs, because it basically
makes a box unavailable to other folks (particularly when your jobs are
long-running). You can use the multiprocessing.cpu_count() function to
determine the number of available cores and subtract 1 or 2 to make sure
that there is some slack available for other processes.