Analytics January 2015

analytics@lists.wikimedia.org

48 participants
47 discussions

Relevant Content Availability
by Abdel Samad, Rawia 16 Oct '15

16 Oct '15

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

5 6

Contribute
by Ron Baasland 17 Jun '15

17 Jun '15

Hello, My username is rbaasland and I would like to contribute to the analytics project. I was wondering if I could have access to the project, or how I go about contributing to this project? Thank you very much, Ron Baasland

3 3

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
by Dario Taraborelli 14 May '15

14 May '15

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

16 47

Re: [Analytics] Calculating interlinks between Wikipedias (Christy Okpo)
by E.C Okpo 03 Feb '15

03 Feb '15

Amir and Neta, This is interesting research! What was the visualization decision process like? I've often seen large inter-connections visualized using Chord or Network diagrams, did you decide on a heat map because of some peculiarities of this dataset? Regards, Christy

2 1

Hive operator precedence
by Nuria Ruiz 31 Jan '15

31 Jan '15

Team, Christian just let me know about the operator precedence in hive. Everyone writing queries should read about this as precedence it's not what you might expect and you query might end up taking forrrr everrrr making other users unhappy. https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Avoiding… Thanks, Nuria

4 3

webrequest_misc added to Kafka + Hive
by Andrew Otto 30 Jan '15

30 Jan '15

Hello! Today qchris and I deployed some changes[1][2] to bring logs from misc_web varnishes into HDFS and Hive via Kafka. This isn’t a huge deal, but it does mean that we are now collecting webrequest logs for things like phabricator, annual.wikimedia.org <http://annual.wikimedia.org/>, graphite, stats.wikimedia.org <http://stats.wikimedia.org/>, etc. That is all, :) -Ao [1] https://gerrit.wikimedia.org/r/#/c/184183 <https://gerrit.wikimedia.org/r/#/c/184183> [2] https://gerrit.wikimedia.org/r/#/c/184191 <https://gerrit.wikimedia.org/r/#/c/184191>

4 5

Something's up with EventLogging since Jan 7th
by Gilles Dubuc 30 Jan '15

30 Jan '15

Hi all, I've tracked down an unexplained EL phenomenon that surfaced in our stats as a false trend in our global stats. The data I'm looking at specifically is coming from Media Viewer's MultimediaViewerNetworkPerformance_* tables. Have a look at this graph: https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56… the big change is on Jan 7th/8th It shows how many EL events we've recorded, per client-reported country, over the last 90 days. The sampling factor we use has been constant for each wiki over that period. Thus, the distribution shouldn't evolve drastically, aside from seasonal/local trends. Besides the Ukraine spike on a particular date (probably related to world events), the graph before Jan 7th looks like what you would expect. Then, following the outage that happened on Jan 7th, not only the balance is completely changed, but it evolves over time (the US and China are keeping "higher than normal" levels, while the rest seems to slide down lower than pre-7th quantities), showing me that something strange is happening and is probably unresolved. This balance shifting over time is really problematic for tracking Media Viewer client-side network performance, because Chinese traffic suddenly accounting for a bigger or smaller share of the overall recorded events creates big swings in the global averages/percentiles (since network performance in China is bad).

3 7

LinkedIn + Kafka Ecosystem
by Andrew Otto 30 Jan '15

30 Jan '15

Nice concise article about Kafka usage and plans at LinkedIn: https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future <https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future>

1 0

Grafana production configs
by E.C Okpo 30 Jan '15

30 Jan '15

Hello, I'm working on adding performance instrumentation to the Parsoid codebase with statsd/node-txstatsd, and then visualizing the metrics via Grafana. I'm at the stage where I'm looking to add the metrics' namespaces and schema to the WMF Grafana configs. It looks like WMF has Grafana working with Graphite/Carbon as a metrics database and ElasticSearch as the db database, where can I find the production Carbon config files to input the settings for tmy metrics? Also, from my research WMF's carbon data retention schema is set to '1m:1y, 10m:10y', should I default to this as my retention schema? Note that the metrics are fired off anytime the Parsoid API is used so each datapoint doesn't necessarily represent a minute/second/etc of data. Thanks, Christy

3 2

pagecounts-raw news
by Andrew Otto 29 Jan '15

29 Jan '15

Hi all! Some of you are probably aware of the pagecounts-raw dataset hosted at http://dumps.wikimedia.org/other/pagecounts-raw/ <http://dumps.wikimedia.org/other/pagecounts-raw/>. This week, we are making a change to how this dataset is generated. This should be mostly transparent, but an announcement is needed just in case anyone notices any differences. pagecounts-raw has historically been generated by piping the udp2log webrequest logs into a C program called webstatscollector[1]. This code is fairly old, and the logic it uses to generate pagecounts is out of date. However, since this data has been public for so long, we made an effort to continue to support it as is. We are still in the process of backfilling, but eventually all pagecounts-raw data after January 1 2015 will be generated from webrequest data stored in HDFS. This data is collected using Kafka, and pagecounts-raw is now generated by Hive. You may see a slight increase in article counts. The webrequest data in HDFS is less lossy than the udp2log data. By the way, do you know about the pagecounts-all-sites[2] dataset? pagecounts-all-sites is in a similar format to pagecounts-raw, but comes with more up to date pagecount logic. Most importantly, it includes mobile site pagecounts. Perhaps you should use pagecounts-all-sites instead of pagecounts-raw, eh? :) -Andrew Otto [1] https://github.com/wikimedia/analytics-webstatscollector <https://github.com/wikimedia/analytics-webstatscollector> [2] http://dumps.wikimedia.org/other/pagecounts-all-sites/ <http://dumps.wikimedia.org/other/pagecounts-all-sites/>

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2015