Analytics January 2015

analytics@lists.wikimedia.org

48 participants
47 discussions

Fwd: Calculating interlinks between Wikipedias
by Neta Livneh 19 Jan '15

19 Jan '15

Hi, Amir Aharoni and I thought that this might be interesting for people here. We wanted to answer the following question: for each language, how many of the articles in the main namespace that appear in one Wikipedia (e.g., FR) also appear in another (e.g., EN). We calculated this as the percentage of articles that exist in two languages from the total number of articles of one of the languages (taken from [1]). That is, we calculated the intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2) [2]. Results: 1. The co-exist matrix of counts can be found Google Spreadsheet <https://docs.google.com/spreadsheets/d/1wj3fPkU8v2-KcEjTNtFLabMTXWyhgvRhgyw…> - It was generated on 01/09/2015 using the langlinks table of every wiki. The underlining query is -based on this code: (%s is the wiki code) SELECT '%s' as source, ll_lang as target, COUNT(*) as count FROM %s_p.langlinks LEFT JOIN %s_p.page ON page_id = ll_from WHERE page_namespace = 0 GROUP BY ll_lang; - The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B compared to lang B to A. - However, it wasn't perfect. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% out links (that is, taken from that language langlink table) than in links (other wikis pointing at that language). - As the number of langlinks gets bigger (and for most cases, the side of the wiki), the difference and variance between the in and out links gets smaller. - Some out links pointed to mistakes (zh-cn, zh-tw, nn) - is fixed. - The raw data can be sent on request. 2. A heat map of the co-exist wikis with more than 50,000 articles. It is ordered by size. As I mentioned, the above triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each wiki. The heat map is between Red - high level of congruence to Yellow - low level. [image: Inline image 1] Points to notice: 1. Most languages have strong connections with English. 2. There is a group of interconnected wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). 3. Piedmontese is highly interconnected with Latin languages, as do Latin itself. On the other hand, Chechen is mostly connected to Russian. 4. a. Arabic has 8% more in links than out.There isn't one Wiki that caused this difference, so it's not a bot. 5. Telugu doesn't have many interlinks, not to English, Hindi or Bengali. 6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not as surprising. Thoughts? Cheers, Neta [1] meta.wikimedia.org/wiki/List_of_Wikipedias updated on 01/12/2015. [2] You might be wondering why did we calculated both EN-> FR and FR-> EN as there is a 1 to 1 connection between the interlanguage links in Wikidata? We used the data from the langlinks table for every Wikipedia and not from the wiki interlanguage link table. We did so for two reasons: 1) it was computationally easier 2) we wanted to see if there are any irregulars in the data.

3 5

Making EventLogging output to a log file instead of the DB
by Gilles Dubuc 19 Jan '15

19 Jan '15

This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead. That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs. So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that? [1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…

6 16

Hadoop Upgrade and Downtime
by Andrew Otto 16 Jan '15

16 Jan '15

Hi all, I’m in the middle of a (slow) upgrade process for the Hadoop cluster. Currently, we are running CDH 5.0.2, and would like to upgrade to CDH 5.3. There are several steps to this process, the first of which is upgrading our OS to Ubuntu Trusty. Along the way, I’m replacing our current NameNodes with different hardware. I am ready to do this now. I don’t see much opportunity to schedule this over the next couple of weeks, due to All-Hands travel, so I’d like to schedule this for tomorrow morning (Friday January 16th). I expect this to be relatively simple downtime, that will only take a few minutes. Just in case, I’d like to reserve 2 hours of time. So, unless there are serious objections, plan for Hadoop to be offline from 2015-01-16 15:45 - 17:45 UTC Also, please don’t start jobs before this time slot that you think will take a long time. If there are running jobs, I either can’t shut down the cluster, or I will have to kill the jobs. If I see running jobs, I’ll try to reach out to you before I kill anything. If anyone is interested in a rough migration plan, it is here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration… <https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration…> Thanks all! -Ao

1 3

Udp2log TSVs missing 1 hour of data
by Marcel Ruiz Forns 16 Jan '15

16 Jan '15

Hi, On Jan 13th 2015 between 22:20 and 23:18 UTC (~1 hour) stat1002 ceased receiving TSV data from udp2log for the following data streams: - Mobile requests stream - Pagecounts-raw - Requests stream - Zero requests stream The reason for that were routing problems in the firewall introduced by a change to the way iptables rules are created for udp2log. The problem was quickly resolved. Here is Phab task: https://phabricator.wikimedia.org/T86973 Apologies for the late email, Marcel

2 1

Office Hours for EventLogging & Dashboarding
by Kevin Leduc 14 Jan '15

14 Jan '15

Please join the Analytics Engineering team for... Office Hours: EventLogging & Dashboarding Hosts: Dan and Nuria Date: January 14 Time: 20:00 UTC - Convert to Local Time <http://www.timeanddate.com/worldclock/fixedtime.html?msg=EventLogging+and+D…> Hangout: https://plus.google.com/hangouts/_/wikimedia.org/a-batcave IRC: #wikimedia-analytics Description: Teams need metrics on how their product or feature is performing, then they need to visualize those metrics. This is accomplished with instrumenting code with EventLogging, mashing data with some queries and setting up a Limn Dashboard. The Analytics Engineering team is open for office hours to answer questions about the process, help solve any issues and listen to feedback on the process. Feel free to drop in the Goolge Hangout linked above or ask questions on the IRC channel during our Office Hours.

2 4

How to get latest stats?
by రహ్మానుద్దీన్ షేక్ 14 Jan '15

14 Jan '15

Hello, I have always seen a difference of about 2000 article count of total articles on Kannada Wikipedia (knwiki or kn.wikipedia.org), which is the accurate count? Also, the stats at stats.wikimedia.org are always a month old or more than that, for Indic languages - Telugu (tewiki), Odia (orwiki), Kannada (knwiki), Marathi (mrwiki). How to get latest stats? Are there any API calls, for Total editors, New editors, active editors, very active editors, article count, new articles per day, edits per month and page views for Indic languages like Kannada, Telugu, Odia, Marathi, etc ? -- With thanks & regards *Rahimanuddin Shaik* నాని [image: http://upload.wikimedia.org/wikipedia/meta/0/08/Wikipedia-logo-v2_1x.png] reachout ఒక విశ్వాన్ని ఊహించండి, ఎక్కడయితే ప్రతి మనిషి ఒక సంపూర్ణ విజ్ఞానభాండారాన్ని అందరితో పంచుకోగలడో, ఆ విశ్వాన్ని ఊహించండి. *అటువంటి విశ్వాన్ని నెలకొల్పడమే మా సంకల్పం.* తెలుగు వికీపీడియా : http://te.wikipedia.org CIS-A2K Program : http://meta.wikimedia.org/wiki/India_Access_To_Knowledge CIS-INDIA : http://cis-india.org A new address for ebooks : http://kinige.com *తెలుగువారికి సాంకేతిక సహాయం - http://techsetu.com <http://techsetu.com/>*

2 1

Beta Labs EventLogging logs
by Ryan Kaldari 14 Jan '15

14 Jan '15

It seems the EventLogging logs have disappeared from /var/log/upstart/ on Beta Labs (deployment-bastion). Does anyone know where they are now? Kaldari

6 12

most clicked links in articles
by Amir E. Aharoni 13 Jan '15

13 Jan '15

Hi, Are there metrics about which links in each article are the most clicked? I can think there's a lot to be learned from it: * Data-driven suggestions for manual of style about linking (too much and too few links are a perennial topic of argument) * How do people traverse between topics. * Which terms in the article may need a short explanation in parentheses rather than just a link. * How far down into the article do people bother to read. Anyway, I can think that accessibility to such data can optimize both readership and editing. And maybe this can be just taken right from the logs, without any additional EventLogging. -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

8 12

Adding “php” key to X-Analytics header
by Christian Aistleitner 13 Jan '15

13 Jan '15

Hi, just a quick heads up, that Ops are about to add a “php” key to the X-Analytics header (i.e.: for sampled-1000 logs, hive, ...): https://gerrit.wikimedia.org/r/#/c/156793/ This header will hold the used PHP implementation [1]. Planned deployment is between 2014-09-01 and 2014-09-02. Have fun, Christian [1] https://wikitech.wikimedia.org/wiki/X-Analytics#Keys -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 2

January 2015 Wikimedia Research Showcase: Felipe Ortega and Benjamin Mako Hill
by Dario Taraborelli 13 Jan '15

13 Jan '15

The upcoming Wikimedia Research showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase> (Wednesday January 14, 11.30 PT) will host two guest speakers: Felipe Ortega <https://en.wikipedia.org/wiki/User:GlimmerPhoenix> (University of Madrid) and Benjamin Mako Hill <https://en.wikipedia.org/wiki/User:Benjamin_Mako_Hill> (University of Washington). As usual, the showcase will be broadcast on YouTube (the livestream link will follow on the list) and we’ll host the QA on the #wikimedia-research IRC channel on freenode. We look forward to seeing you there. Dario Functional roles and career paths in Wikipedia By Felipe Ortega <https://www.mediawiki.org/wiki/User:GlimmerPhoenix> An understanding of participation dynamics within online production communities requires an examination of the roles assumed by participants. Recent studies have established that the organizational structure of such communities is not flat; rather, participants can take on a variety of well-defined functional roles. What is the nature of functional roles? How have they evolved? And how do participants assume these functions? Prior studies focused primarily on participants' activities, rather than functional roles. Further, extant conceptualizations of role transitions in production communities, such as the Reader to Leader framework, emphasize a single dimension: organizational power, overlooking distinctions between functions. In contrast, in this paper we empirically study the nature and structure of functional roles within Wikipedia, seeking to validate existing theoretical frameworks. The analysis sheds new light on the nature of functional roles, revealing the intricate “ areer paths" resulting from participants' role transitions. Free Knowledge Beyond Wikipedia A conversation facilitated by Benjamin Mako Hill <https://www.mediawiki.org/wiki/User:Benjamin_Mako_Hill> In some of my research with Leah Buechley <http://mako.cc/academic/buechley_hill_DIS_10.pdf>, I’ve explored the way that increasing engagement and diversity in technology communities often means not just attacking systematic barriers to participation but also designing for new genres and types of engagement. I hope to facilitate a conversation about how WMF might engage new readers by supporting more non-encyclopedic production. I'd like to call out some examples from the new Wikimedia project proposals list <https://meta.wikimedia.org/wiki/Proposals_for_new_projects>, encourage folks to share entirely new ideas, and ask for ideas about how we could dramatically better support Wikipedia's sister projects.

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2015