Analytics August 2013

analytics@lists.wikimedia.org

30 participants
33 discussions

Re: [Analytics] Visualizing Indic Wikipedia projects.
by sumandro 13 Mar '14

13 Mar '14

Erik, Thanks a lot for the appreciation. As Sajjad mentioned, we have already obtained a edit-per-location dataset from Evan (Rosen) that has the following column structure: *language,country,city,start,end,fraction,ts* *start* and *end* denote the beginning and ending date for counting the number of edits, and *ts* is time stamp. The *fraction*, however, gives a national ratio of edit activity, that is it gives the ratio of 'total edits from that city for that language Wikipedia project' divided 'total edits from that country for that language Wikipedia project'. Hence, it cannot be used to understand global edit contributions to a Wikipedia project (for a time period). It seems that the original data (from where this dataset is extracted) should also have the global fractions -- total edit from a city divided by total edit from the whole world, for a project, for a time period. Would you know if the global fractions can also be derived from the XML dumps? Or, even better, is the relevant raw data available in CSV form somewhere else? Bests, sumandro ------------- sumandro ajantriks.net On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org wrote: > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > ---------------------------------------------------------------------- > > > Date: Tue, 14 May 2013 19:40:00 +0200 > From: "Erik Zachte" <ezachte(a)wikimedia.org> > To: "'A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics.'" > <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] Visualizing Indic Wikipedia projects. > Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org> > Content-Type: text/plain; charset="iso-8859-1" > > Awesome work! I like the flexibility of the charts, easy to switch metrics > and presentation mode. > > > > 1. WMF has never captured ip->geo data on city level, but afaik this is > going to change with Kraken. > > > > 2. Total edits per article per year can be derived from the xml dumps. I may > have some csv data that come in handy. > > For edit wars you need track reverts on an per article basis, right? That > can also be derived from dumps. > > For long history you need full archive dumps and need to calc checksum per > revision text. (stub dumps have checksum but only for last year or two) > > > > Erik Zachte > > >

8 10

the use of the templates: comparison between different wikipedias
by Yury Katkov 11 Mar '14

11 Mar '14

Hi everyone! Has anyone tried to observer how different wikipedias use the templates: how often, what's the average depth of template calls, etc? ----- Yury Katkov, WikiVote

5 7

foundationwiki pageviews underreporting
by Federico Leva (Nemo) 31 Oct '13

31 Oct '13

Henrik updated the top view charts and few days ago foundationwiki was added to webstatscollector. http://stats.grok.se/www.f/top shows Most viewed articles in 201304 Rank Article Page views 1 Trang chủ 912 2 Portada galega 324 3 Home 182 4 Local chapters 172 etc. This seems highly unlikely, is the problem known? Nemo

3 3

Statistics on gadget & bot usage on all wikis
by Sumana Harihareswara 30 Sep '13

30 Sep '13

Summary: we have some new stats regarding gadget usage across WMF sites, but I'd like more analysis of gadget & bot usage. Oliver Keyes has some code and results up at https://github.com/Ironholds/MetaAnalysis/tree/master/GadgetUsage to analyze "data around gadgets being used on various wikimedia projects": "GadgetUsage.r is the generation script. It is dependent on (a) access to the analytics slaves and (b) the list of databases "gadget_data.tsv is the raw data, consisting of an aggregate number of users for each preference on each wiki, with preference, wiki and wiki type (source, wiki, versity, etc) defined. "gadgets_by_wikis.tsv is a rework of the data to look at what gadgets are used on multiple wikis, and how many wikis that is. It also includes an aggregate of the number of users across those wikis using the gadget. "wikis_by_gadgets.tsv is a rework that looks at the number of distinct gadgets on each individual wiki. Unsuprisingly there's a power law." This helps a lot with addressing one of the analytics "dreams" from https://www.mediawiki.org/wiki/Analytics/Dreams - "What proportion of logged-in editors have activated any gadgets at all? What are the most popular gadgets?" However, Oliver's data "is based on preference data - it may or may not include data for those gadgets set as defaults." So if someone could improve this to ensure that we appropriately count gadget usage for gadgets that default to on, that would be very helpful. My team would also like to know: * who maintains the most popular gadgets? (so we can invite them to hackathons, help get them training, get those gadgets localised and ported to other wikis, and so on) * when were the gadgets last updated? (so we can identify stale ones that enthusiastic volunteers could take over maintaining) * similar stats regarding bot usage -- what bots are making the most edits, or edits that in aggregate change the most bytes? who owns those bots? what wikis are they active on? (so we can help maintainers better, ensure they hear about API breaking changes, etc., and develop a bot inventory/directory to make it easier for other wikis' users to start using useful bots) If there's anyone interested in taking this on, either inside or outside WMF's Analytics team, that would be great. Otherwise I anticipate that Engineering Community Team will take it on sometime in the October-December 2013 period. -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

6 7

Dashboards/graphs/dashsources/... at gp.wmflabs.org
by Christian Aistleitner 20 Sep '13

20 Sep '13

Hi, we are currently serving a few hundred graphs, dashboards, ... at http://gp.wmflabs.org/ , but running the various scripts that generate them is a bit shaky and their maintenance is eating up a considerable amount of time. So in order to better use resources, and limit maintenance work, we're curious about which parts, URLs, dashboards, graphs, datasources of the site are actually in use by people in one way or the other. If you rely on parts, URLs, dashboards, graphs, datasources of http://gp.wmflabs.org/ please let us know by August 30. Best regards, Christian P.S.: We may think about removing unused parts or stop even trying to update them. So if you are using some parts, please do let us know :-) P.P.S.: We already reached out to the users that we know of. So do not feel pressed to reply again, if you have already replied to the private email about this issue. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 6

Limn: move away from Coco?
by Diederik van Liere 02 Sep '13

02 Sep '13

Heya, I have been talking with a lot of you in the past months and at Wikimania about Limn and how to move forward. One of the recurring themes has been that currently Limn is written in Coca and that significantly hinders adoption as there are very few Coco developers (Coco is a fork of Coffeescript). I have sent this email to mobile-tech, e2 and e3 mailinglists as well because there are many developers outside of the Analytics team who use Limn and I would really like to hear their opinion as well. So the question I want to pose is: "Should we recompile Limn to either Coffeescript or Javascript or keep using Coco?" This question is getting more urgent because of two reasons: 1) The Analytics team is going to grow in the coming months and we expect to start developing features for Limn again and if we want to drop Coco as dependency then this is probably the best time to talk about it. 2) It seems that the community around Coco is stagnant maybe even on the decline. When visiting https://github.com/satyr/coco you can see that there are very few commits in the last 4 months. This could either mean that the language is feature complete and bug free or more likely that the decline has started. For the long-term prospects of Limn, this is not good news. I would like to run a strawpoll and please respond to this thread by answering with either Javascript, Coffeescript or Coco and optionally a short explanation. Thanks! D

7 8

Retrieving edit conflict logs
by Adam Wight 01 Sep '13

01 Sep '13

Dear comrades, I'm hoping to provide a data stream and archival data for edit conflict events on *.wikipedias. The short-term goal is to help support further research into heuristic reconstruction of the article revision graph, see this paper presented by Jianmin Wu (author CC'ed here): http://opensym.org/wsos2013/proceedings/p0204-wu.pdf The only marker I have found so far is, unfortunately, a message emitted using wfDebug. Do we have an archive of production debug logs, and what is the process I would follow for proposing a historical experiment or an ongoing filter using this data? For anyone who's curious, I think the main string I'm looking for is "Keeping edit conflict, failed merge.", but it would be worthwhile to analyze logging from every code path within conflict resolution.

3 2

Analytics Showcase Sprint ending August 21st, 2013
by Diederik van Liere 30 Aug '13

30 Aug '13

Hi! A belated Sprint email. Apologies! Slidedeck is available at https://docs.google.com/a/wikimedia.org/presentation/d/1aGfAPAKMWc9wbVKfKm6… ## Defects & Features completed (Ready for Showcase/Shipping/Done) during Sprint ending 2013-08-21 ## 385 Migration of stat1 (pmtpa) to stat1002 (eqiad) Infrastructure Task Diederik van Liere - Analytics Logging Infrastructure 429 View detailed list of jobs / requests in queue Feature Dario Taraborelli - Product (E3) Wikimetrics 768 High Availability NameNode Infrastructure Task Operations Kraken 817 Use new dclass-api in Kraken Defect Tomasz Finc - Product (Mobile Web & Apps) Kraken 820 Reportcard June 2013 Feature Erik Moeller - Executive Office Limn Dashboards 822 Aggregation of Metric results Feature Dario Taraborelli - Product (E3) Wikimetrics 824 Support page Feature Frank Schulenburg - Grantmaking Wikimetrics 827 Re-enable global south active editors dashboard Defect Frank Schulenburg - Grantmaking Limn Dashboards 1022 Non-serializable JSON output Defect Frank Schulenburg - Grantmaking Wikimetrics 933 Setup Kafka and Camus in Labs Infrastructure Task Diederik van Liere - Analytics Logging Infrastructure 1025 [Community] Repair gerrit's Json output for gsql Defect Platform Engineering Misc 1069 [Bug 52749] 500 error when canceling Google OAuth authentication Defect Diederik van Liere - Analytics Wikimetrics 1072 Investigate drop in number of web requests for mobile-100 stream Defect Amit - Wikipedia Zero Logging Infrastructure 1077 Outdated datasources on gp.wmflabs.org Defect Jessie Wild - Learning & Evaluation Limn Dashboards ## Current Sprint (ending 2013-09-04) ## 731 Reinstall Hadoop Nodes Kraken Diederik van Liere - Analytics 8 704 Measure pages created by an editor Wikimetrics Jaime Anstee - Grantmaking 5 733 Reinstall Zookeeper Nodes Kraken Diederik van Liere - Analytics 3 735 Reinstall Ciscos 1005-1009 Kraken Diederik van Liere - Analytics 760 Debianize Librdkafka Logging Infrastructure Operations 3 823 Increase unit-test coverage Wikimetrics Diederik van Liere - Analytics 8 1023 Take mobile jobs off of Hadoop Kraken Tomasz Finc - Product (Mobile Web & Apps) 5 1079 Monitoring geowiki dashboards Limn Dashboards Jessie Wild - Learning & Evaluation 8 1081 Repave Hadoop Cluster Kraken Diederik van Liere - Analytics 1089 Add metadata to CSV output Wikimetrics Jaime Anstee - Grantmaking 1 1092 Fix global south editor fractions Limn Dashboards Jessie Wild - Learning & Evaluation 1 1093 Setup kafka failover modes Diederik van Liere - Analytics 1094 Create geomap of https failures Limn Dashboards Dario Taraborelli - Product (E3) 1 1111 Reportcard July 2013 Limn Dashboards Erik Moeller - Executive Office 1 (Number in parentheses) = estimate of complexity N/E = not estimated; F = Feature D = Defect I = Infrastructure Task S = Spike Any mingle card can be accessed using the base url https://mingle.corp.wikimedia.org/projects/analytics/cards/XYZ where XYZ is the Mingle card id. If you have any questions, comments or feedback: please let us know! Apologies for cross-posting; ideally you should receive this on the Analytics Mailinglist so we can have one focal point for conversation. If you are not on the Analytics list then please subscribe at https://lists.wikimedia.org/mailman/listinfo/analytics Best, Diederik

1 0

Analyzing analytics/ repos
by Quim Gil 29 Aug '13

29 Aug '13

Hi, I'm scanning the gerrit repo to extract the projects that we want to scan for our tech community metrics. There is a lot of stuff under analytics/ https://gerrit.wikimedia.org/r/#/admin/projects/?filter=analytics Should we include all of it or are there repositories that we can ignore? What are discarding in general: * Upstream projects that we simply repackage or fork with a few patches. * Data (as opposed to code) that would just bloat our metrics. * Sandboxes and personal experiments. PS: this will be also useful to update https://www.ohloh.net/p/wmf-analytics (are you really managing 1M lines of code?) -- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

2 3

Use of Python multiprocessing module
by Diederik van Liere 29 Aug '13

29 Aug '13

Heya, A lot of us who write Python code use the multiprocessing module because it's an easy way to distribute the workload among many cpu's. But when you do, please do not allocate all cores to your jobs, because it basically makes a box unavailable to other folks (particularly when your jobs are long-running). You can use the multiprocessing.cpu_count() function to determine the number of available cores and subtract 1 or 2 to make sure that there is some slack available for other processes. thx! D

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics August 2013