Analytics March 2015

analytics@lists.wikimedia.org

52 participants
49 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Relevant Content Availability

by Abdel Samad, Rawia

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

8 years, 6 months

Monthly compressed traffic delay

by Michael Hale

Hello, I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though. I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer? Thanks, Michael

8 years, 10 months

Contribute

by Ron Baasland

Hello, My username is rbaasland and I would like to contribute to the analytics project. I was wondering if I could have access to the project, or how I go about contributing to this project? Thank you very much, Ron Baasland

8 years, 10 months

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

by Dario Taraborelli

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

8 years, 11 months

Re: [Analytics] stats.grok.se not updating

by Kevin Leduc

Thank you! Would you mind posting a note on Analytics(a)lists.wikimedia.org when it is working normally again? On Wed, Feb 11, 2015 at 1:36 PM, Henrik Abelsson <henrik(a)abelsson.com> wrote: > Hi Kevin, > > Looking into it! > > -henrik > > > On 11/02/15 16:36, Kevin Leduc wrote: > > Hi Henrik, > > stats.grok.se has missing data in the last week. Can you restart the > service to see if that helps? > > Thanks! > Kevin Leduc > Analytics Product Manager > > >

9 years

[Cluster] If Analytics Cluster seems slow over the weekend, read this email [aka HDFS balancing]

by Christian Aistleitner

Hi, TL;DR: If you think your Hive queries are currently taking longer than usual, please find qchris in IRC, and if he is not responsive, kindly ask someone with root on stat1002 (like Ops) to kill the process java -Dproc_balancer -Xmx1000m [...] ----------------------------------------------------- Data in the Analytics cluster is not evenly distributed. Some data nodes are >90% full, while others are half empty. Data nodes that are >90% full are considered unhealthy and no longer contribute to the pool of available resources. So unhealty data nodes no longer contribute to the total available memory in the cluster. There are other motivations too, but the latter item alone is enough motivation to keep the data nodes balanced and hence healthy. Rebalancing is running since 2015-02-26, but situation is getting worse quicker than rebalancing can rebalance. We've been up to 5 unhealthy nodes. Since we're missing their memory, I decided that we should rebalance more aggressively. Hence, I bumped the rebalancer's capacity, and nodes are recovering and getting healthy again. I am monitoring the increased-capacity rebalancer closely, but in case you're getting blocked by it without me noticing, please find me in IRC and let me know, so I can turn the rebalancer's capacity down. Or if you find me unresponsive, please find someone with root on stat1002 (like Ops) and ask thon to kill the process java -Dproc_balancer -Xmx1000m [...] on stat1002. Have fun, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years

Parsoid Performance Metrics

by E.C Okpo

Hello, Parsoid now has dashboards that track performance metrics for both the html to wikitext (1) and wikitext to html (2) routes. Performance instrumentation was achieved with StatsD, Graphite and Grafana. I also compiled a guide (3) to this process for future reference, though your mileage might vary. These materials were created as part of my FOSS-OPW Internship with the Parsoid team, which ends today :(. It's been such a blast working with the Parsoid team, meeting members of the community and getting a taste of working on Open Source Software. Regards, Christy Okpo (1) http://grafana.wikimedia.org/#/dashboard/db/parsoid-timing-html2wt (2) http://grafana.wikimedia.org/#/dashboard/db/parsoid-timing-wt2html (3) https://www.mediawiki.org/w/index.php?title=Parsoid/Adding_instrumentation_…

9 years

Country data from Squid

by Jakub Havlík

Hi, I'm a student of computational physics from Czech Republic and I sometimes used data displayed here http://stats.wikimedia.org/wikimedia/squids/SquidReportCountryData.htm for my personal analysis of Wikipedia just to know how is used and trending. But it has gone silent during January and there are no updates for year 2015. Do you plan to publish country data somewhere? Thank you, Jakub Havlik

9 years

[Announce] New daily feed: media file request counts

by Erik Zachte

Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1]. The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts. It can be seen as complimentary to our page view counts files [5]. The file layout is documented on wikitech [3]. Daily counts have been backfilled from January 1, 2015 onwards. Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program). The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task. An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata). Online consultation followed, leading to the current design [4]. This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made): When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not. Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9] Huge thanks to all people who contributed to the process so far, and still do. Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation. Erik Zachte [1] <http://dumps.wikimedia.org/other/mediacounts/> http://dumps.wikimedia.org/other/mediacounts/ [2] <https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun ts#Filtering> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s#Filtering [3] <https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts> https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts [4] <https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun ts> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s [5] <https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites> https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites (a new version of this data feed is in the works) [6] https://phabricator.wikimedia.org/T89088 [7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools. [8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer [9] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s#by_context

9 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2015