Analytics September 2016

analytics@lists.wikimedia.org

32 participants
20 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 6 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Parsing user agents in EventLogging data

by Tilman Bayer

Hi all, the webrequest and pageview_hourly tables on Hive contain the very useful user_agent_map field, which stores the following data extracted from the raw user agent (still available as a separate field): device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version. (The Analytics Engineering team has built a dashboard that uses this data and last month published a popular blog post about it.) I understand it is mainly based on the ua-parser library (http://www.uaparser.org/ ) . In contrast, the event capsule in our EventLogging tables only contains the raw, unparsed user agent. * Does anyone on this list have experience in parsing user agents in EventLogging data for the purpose of detecting browser family, version etc, and would like to share advice on how to do this most efficiently? (In the past, I have written some expressions in MySQL to extract the app version number for the Wikipedia apps. But it seems a bit of a pain to do that for classifying browsers in general. One option would be to export the data and use the Python version of ua-parser, however doing it directly in MySQL would fit better into existing workflows.) * Assuming it is technically possible to add such a pre-parsed user_agent_map field to the EventLogging tables, would other analysts be interested in using it too? This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

7 years, 4 months

Seeking feedback (+ answer to 1 question) on a timeline of Wikipedia analytics

by Vipul Naik

Dear Analytics mailing list, I am working, along with Issa Rice (cc'ed) on an analysis of changes to Wikipedia pageviews since December 2007, when pageview statistics first started being maintained. To help with our analysis, we collected key events related to changes to user experience on the site as well as to statistics availabilty and measurement. We've recorded our findings on this page in Issa's userspace: https://en.wikipedia.org/wiki/User:Riceissa/Timeline_of_Wikipedia_analytics I'd appreciate it if you can highlight: (a) Factual errors in the material currently in the timeline (b) Missing events that you think should belong in the timeline, with regards to the availability of statistics as well as any other events that affected user experience significantly. In addition, I had the following question: In the Wikimedia per-article pageviews API https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_ metrics_pageviews_per_article_project_access_agent_article_g ranularity_start_end, where do Wikipedia Zero pageviews get recorded? Do they go under mobile-web, or mobile-app, or neither? https://wikitech. wikimedia.org/wiki/Analytics/Data/Pageviews <https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview> gives some information on how the underlying pageviews are recorded in the pageviews dataset, but I wasn't clear on how the pageview REST API processes that data. Thank you! Vipul NOTE: We don't intend to move the page to Wikipedia's main space, as we know it won't meet the notability criterion. The user space was just a convenient place to store it while taking advantage of MediaWiki's syntax and Wikipedia's templates.

7 years, 6 months

Search queries: Use of modifiers

by Jan Dittrich

Hello Analytics, Wikipedia’s search function exposes several modifiers ( https://www.mediawiki.org/wiki/Help:CirrusSearch) On the recent German Wikicon there was a workshop on search and several community members seemed to be enthusiastic about these functions. I wonder if there is existing information about the current use of such queries. I did some research, but I could not find out much. Such information could help to improve the search function, since sometimes a few modifiers are heavily used (despite them being hard to access) and could e.g. be exposed via the user interface. Jan -- Jan Dittrich UX Design/ User Research Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 http://wikimedia.de Imagine a world, in which every single human being can freely share in the sum of all knowledge. That‘s our commitment. Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

7 years, 7 months

Question re. PageCounts

by Gheorghe Postelnicu

Hello, First of all, many thanks for this wonderful project! I am writing as I downloaded the July pagecounts data from: https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-07/ As I was browsing it, I was surprised to notice that some entities, such as the movie "Suicide Squad", only seem to have gotten very sparse views in July - see below. In comparison, the clicks for Jason Bourne seem to have been much higher for the same period. Below are lines from the logs. Am I doing something wrong? Many thanks in advance, Gheorghe *Suicide Squad*: (pagecounts-20160727-020000.gz,en Suicide_squad_(film) 1 6614) (pagecounts-20160727-160000.gz,en Suicide_squad_(film) 1 25599) (pagecounts-20160728-220000.gz,en Suicide_squad_(film) 2 32210) (pagecounts-20160731-210000.gz,en Suicide_squad_(film) 11 72721) *Jason Bourne*: pagecounts-20160731-210000.gz,sv Jason_Bourne_(film) 12 124894) (pagecounts-20160731-210000.gz,tr Jason_Bourne_(film) 78 1852192) (pagecounts-20160731-220000.gz,en File:Jason_Bourne_(film).jpg 2 19067) (pagecounts-20160731-220000.gz,en Jason_Bourne_(film) 2119 73275075) (pagecounts-20160731-220000.gz,en Talk:Jason_Bourne_(film) 1 10059) pagecounts-20160731-220000.gz,fr Jason_Bourne_(film) 55 1226127) (pagecounts-20160731-220000.gz,hu Jason_Bourne_(film) 3 34335) (pagecounts-20160731-220000.gz,it Jason_Bourne_(film) 29 579129) (pagecounts-20160731-220000.gz,nl Jason_Bourne_(film) 11 125928)

7 years, 7 months

Intro to "Committee" and "Diversity" section approved, "Conflict of interest" needs more work

by Matthew Flaschen

The community approved the introduction to the Committee section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_Conduct.…) (the part after "Committee" and before the "Diversity" section), as well as the "Diversity" section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Diversity). There was not consensus to approve the "Conflict of interest" (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Conflict_of_interest) section. Work will continue on this section. See the top of https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finalize_.22Confl… (including the sections linked from those bullet points). Thanks, Matt Flaschen

7 years, 7 months

Where is the 2010 survey?

by Reem Al-Kashif

Hi, I was wondering where the famous (or rather infamous) 2010 survey is? This one was made by the WMF and showed that women made less than 13% of WP contributors (mentioned here <https://meta.wikimedia.org/wiki/Women_and_Wikimedia_Survey_2011>). I can't find it anywhere. Help is much appreciated. :) Best, Reem -- *Kind regards,Reem Al-Kashif*

7 years, 7 months

Research Showcase, September 21, 2016

by Sarah R

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, September 21, 2016 at 11:30 AM (PST) 18:30 (UTC). YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80 As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2016>. This month's showcase includes. Finding News Citations for WikipediaBy *Besnik Fetahu <http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two- stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale. Designing and Building Online Discussion SystemsBy *Amy X. Zhang <http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are everywhere on the Internet and come in many different forms. However, there are still many problems with discussion interfaces today. In my talk, I will first give an overview of some of the problems with discussion systems, including difficulty dealing with large scales, which exacerbates additional problems with navigating deep threads containing lots of back-and-forth and getting an overall summary of a discussion. Other problems include dealing with moderation and harassment in discussion systems and gaining control over filtering, customization, and means of access. Then I will focus on a few projects I am working on in this space now. The first is Wikum, a system I developed to allow users to collaboratively generate a wiki-like summary from threaded discussion. The second, which I have just begun, is exploring the design space of presentation and navigation of threaded discussion. I will next discuss Murmur, a mailing list hybrid system we have built to implement and test ideas around filtering, customization, and flexibility of access, as well as combating harassment. Finally, I'll wrap up with what I am working on at Google Research this summer: developing a taxonomy to describe online forum discussion and using this information to extract meaningful content useful for search, summarization of discussions, and characterization of communities. Hope to see you there! Sarah R. Rodlund Senior Project Coordinator-Engineering, Wikimedia Foundation srodlund(a)wikimedia.org

7 years, 7 months

Upcoming reboots of stat and Hadoop hosts due to Kernel upgrades

by Luca Toscano

Hi everybody, the Analytics team is going to reboot all the stat hosts (stat1002, stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels (security upgrade required). The work will start tomorrow morning (Sep 22nd) at around 9:00 AM CEST. This task might interfere with ongoing Hadoop jobs or processes running on the stat* hosts, so please let me know if there is any motivation to postpone the maintenance. Please also feel free to reach out to the analytics IRC channel or to me directly if you have more questions :) Thanks! Regards, Luca

7 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2016