Analytics June 2017

analytics@lists.wikimedia.org

20 participants
15 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Migrated Reportcard with Updated Data
by Nuria Ruiz 12 Mar '18

12 Mar '18

Hello! The Analytics team would like to announce that we have migrated the reportcard to a new domain: https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-… The migrated reportcard includes both legacy and current pageview data, daily unique devices and new editors data. Pageview and devices data is updated daily but editor data is still updated ad-hoc. The team is working at this time on revamping the way we compute edit data and we hope to be able to provide monthly updates for the main edit metrics this quarter. Some of those will be visible in the reportcard but the new wikistats will have more detailed reports. You can follow the new wikistats project here: https://phabricator.wikimedia.org/T130256 Thanks, Nuria

4 6

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

EventStreams launch and RCStream deprecation
by Andrew Otto 10 Jul '17

10 Jul '17

Hi everyone! Wikimedia is releasing a new service today: EventStreams <https://wikitech.wikimedia.org/wiki/EventStreams>. This service allows us to publish arbitrary streams of JSON event data to the public. Initially, the only stream available will be good ol’ RecentChanges <https://www.mediawiki.org/wiki/Manual:RCFeed>. This event stream overlaps functionality already provided by irc.wikimedia.org and RCStream <https://wikitech.wikimedia.org/wiki/RCStream>. However, this new service has advantages over these (now deprecated) services. 1. We can expose more than just RecentChanges. 2. Events are delivered over streaming HTTP (chunked transfer) instead of IRC or socket.io. This requires less client side code and fewer special routing cases on the server side. 3. Streams can be resumed from the past. By using EventSource, a disconnected client will automatically resume the stream from where it left off, as long as it resumes within one week. In the future, we would like to allow users to specify historical timestamps from which they would like to begin consuming, if this proves safe and tractable. I did say deprecated! Okay okay, we may never be able to fully deprecate irc.wikimedia.org. It’s used by too many (probably sentient by now) bots out there. We do plan to obsolete RCStream, and to turn it off in a reasonable amount of time. The deadline iiiiiis July 7th, 2017. All services that rely on RCStream should migrate to the HTTP based EventStreams service by this date. We are committed to assisting you in this transition, so let us know how we can help. Unfortunately, unlike RCStream, EventStreams does not have server side event filtering (e.g. by wiki) quite yet. How and if this should be done is still under discussion <https://phabricator.wikimedia.org/T152731>. The RecentChanges data you are used to remains the same, and is available at https://stream.wikimedia.org/v2/stream/recentchange. However, we may have something different for you, if you find it useful. We have been internally producing new Mediawiki specific events <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…> for a while now, and could expose these via EventStreams as well. Take a look at these events, and tell us what you think. Would you find them useful? How would you like to subscribe to them? Individually as separate streams, or would you like to be able to compose multiple event types into a single stream via an API? These things are all possible. I asked for a lot of feedback in the above paragraphs. Let’s try and centralize this discussion over on the mediawiki.org EventStreams talk page <https://www.mediawiki.org/wiki/Talk:EventStreams>. In summary, the questions are: - What RCStream clients do you maintain, and how can we help you migrate to EventStreams? <https://www.mediawiki.org/wiki/Topic:Tkjkee2j684hkwc9> - Is server side filtering, by wiki or arbitrary event field, useful to you? <https://www.mediawiki.org/wiki/Topic:Tkjkabtyakpm967t> - Would you like to consume streams other than RecentChanges? <https://www.mediawiki.org/wiki/Topic:Tkjk4ezxb4u01a61> (Currently available events are described here <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…> .) Thanks! - Andrew Otto

5 6

Data update on wmflabs.org & time zone of data
by Stefan Schneider 28 Jun '17

28 Jun '17

Hello, I recently started working at Wikimedia Germany. My focus is new editor retention work in the german community. In that role my team created a tool to count views of videos the potential new editor viewed on a wiki-page. If you're interested you can find it here: https://tools.wmflabs.org/commons-video-clicks/ . I have 2 questions to the data used in that tool and hope you can halp me with that. First, the tool is using the following query to get the data in JSON from WMF-Database and showing it in the tool: https://tools.wmflabs.org/ mediaplaycounts/api/1/FilePlaycount/date_range/How_ Wikipedia_contributes_to_free_knowledge.webm/20170501/20170503. A colleague told me that in the consecutive dates in the JSON output there is stated that the data is updated daily. The problem is, that form June on there is no data for the views. E. g. the following query gives no data: https://tools.wmflabs.org/mediaplaycounts/api/1/ FilePlaycount/date_range/How_Wikipedia_contributes_to_free_ knowledge.webm/20170601/20170603.Do you have more information on the update rates or missing update of the data itself? Second, do you know in whitch time zone the date in the database is? I would love to have a feedback if this message reached you and if or when you could help me with that. Maybe you know someone else who can support me here? Big Thanks in advance and best regards, Stefan -- Stefan Schneider Projektassistenz Neue Freiwillige Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen Wissens frei teilhaben kann. Helfen Sie uns dabei! http://spenden.wikimedia.de/ Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

2 1

Interesting Finding on Navigation vs Content Vectors
by Shilad Sen 27 Jun '17

27 Jun '17

Hi Everybody, I just ran an experiment that surprised me and I thought folks on this list would find interesting. *tl;dr* We found that navigation vector embeddings for articles (as produced by Ellery Wulcyzn <https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors>) outperform content-based vector embeddings (word2vec on article text) by 62% vs 37% accuracy in a task-based user study. I've volunteered to help with the engineering to productionize navigation embedding <https://phabricator.wikimedia.org/T158972> and this study reinforces my eagerness to get navigation vectors out in the world! *More detail: *The maps we use in Cartograph (cartograph.info) are almost entirely built on "embedding" vectors for articles. We experimented with two word2vec-based embeddings: *content vectors* mined from article text and link structure, and* navigation vectors* mined from user browsing sessions. For the latter, we used Ellery Wulczyn's navigation vectors <https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors>. By staring at maps, our intuition told us that the navigation vectors seemed better in "preference spaces" where the human taste space wasn't necessarily easily encoded into Wikipedia text. Last weekend we ran a Mechanical Turk experiment to test this intuition. We created two Cartograph maps of movies: one built on navigation vectors and one built on content vectors. We identified 40 relatively popular movies that were not close neighbors in either map (i.e. cities that were not too close to each other) and ran a Mechanical Turk study using the maps. For each Turker, we randomly selected 5 seen movies (out of the 30), and asked them to evaluate maps for each movie. For each movie city, we showed the map region around the city, but hid the city and asked them to guess the city from a list of 12 movies they had seen (screenshot below). We added in trivial validation questions using sequels to ensure Turkers were working in good faith (show a map for "Rocky II" that had "Rocky" at the center). Result: Turkers exhibited 62% accuracy with the navigation vectors and 37% accuracy with content vectors. We want to conduct several follow-up studies to understand different subject areas and parameter settings and user tasks, but the difference in performance was striking. Our study shows the value of navigation vectors and makes me super excited to contribute to the engineering needed to get them out to the world on a regular basis. Imagine if every researcher and practitioner who uses word2vec now on Wikipedia content switches to navigation vectors. That's a huge audience! Feedback and questions welcome! -Shilad -- Shilad W. Sen Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College Senior Research Fellow, Target Corporation ssen(a)macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273

2 1

Alter tables for the log database on the analytics slaves
by Luca Toscano 27 Jun '17

27 Jun '17

Hi everybody, the Analytics team is working on some alter tables to the Eventlogging 'log' database on analytics-store (dbstore1002) and analytics-slave (db1047) as part of https://phabricator.wikimedia.org/T167162. The list of alter tables are the following: https://phabricator.wikimedia.org/P5570 This should be a transparent change but I thought it would have been better to keep all of you informed in case of unintended regressions or side-effects. The context of the alter tables is in T167162 but the TL;DR is that we need nullable attributes across all the EL tables (except fields like id, uuid and timestamp) to be able to sanitize data with our new eventlogging_cleaner script (https://phabricator.wikimedia.org/T156933). Please let me know if you encounter any issue with this change. Thanks in advance! Luca

1 0

[DEPRECATED] datasets.wikimedia.org
by Dan Andreescu 22 Jun '17

22 Jun '17

Hi all, *Who: *This mostly applies to people who have access to the stat1002 and stat1003 statistics machines on the production cluster, and publish datasets as static files. *What:* We are no longer using datasets.wikimedia.org to serve static datasets. We have set up a redirect, so requests like https://datasets.wikimedia.org/ $1 will be sent to https://analytics.wikimedia.org/datasets/archive/ $1. Most importantly, publishing datasets is now much easier. Any files you put in published-datasets on either machine: stat1002:/a/published-datasets stat1003:/srv/published-datasets Are going to be merged together and served together on: https://analytics.wikimedia.org/datasets/ One request as we all enjoy this much simpler process: let's use README files in these directories to let future versions of us know what the datasets are all about. That will make the repository more fun for others to browse and ease future cleanups. Thank you! *TODO* If something of yours got lost, let us know, we have backups. If you had stuff that we might have cleaned up, we put it in /srv/otto-to-delete-datasets-cleanup and /a/otto-to-delete-datasets-cleanup. Take a look there and you can move files as you see fit into published-datasets *Context* For a long time, publishing files from stat1002 and stat1003 was quite painful. There were three folders, some on both boxes, some only on one box, symlinks, rsyncs, it was bad. We talked to everyone who had files in these folders and gathered consensus for this deprecation. If this message catches you by surprise, please let us know what channel we should reach you in next time and we'll add it to our communication plan. This work is tracked in T159409 <https://phabricator.wikimedia.org/T159409>

1 0

Wikipedia Detox: Scaling up our understanding of harassment on Wikipedia
by Ellery Wulczyn 22 Jun '17

22 Jun '17

Today we are announcing <https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment/> the first results of the collaboration between Wikimedia Research and Jigsaw on modeling personal attacks and other forms of harassment on English Wikipedia. We have released <https://figshare.com/projects/Wikipedia_Talk/16731> a corpus of 95M user and article talk page comments as well as over 1M human labels produced by 4000 crowd-workers for a set of 100k comments. Documentation on our methodology and future work can be found in our paper Ex Machina: Personal Attacks Seen at Scale <https://arxiv.org/abs/1610.08914> (to appear at WWW2017) and on our project page on meta <https://meta.wikimedia.org/wiki/Research:Detox>. If you are interested in contributing to the project, please get in touch via the project talk page <https://meta.wikimedia.org/wiki/Research_talk:Detox>. Another great way to get involved is to label a set of comment in the Wikilabels discussion quality campaign <http://labels.wmflabs.org/ui/enwiki/>.

6 6

Research Showcase Wednesday June 21, 2017
by Sarah R 21 Jun '17

21 Jun '17

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, June 21, 2017 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=i2jpKRwPT-Q As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2017>. This month's presentations: Title: Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia By *Allen Yilun Lin* Abstract: Wikipedia-based studies and systems frequently assume that each article describes a separate concept. However, in this paper, we show that this article-as-concept assumption is problematic due to editors’ tendency to split articles into parent articles and sub-articles when articles get too long for readers (e.g. “United States” and “American literature” in the English Wikipedia). In this paper, we present evidence that this issue can have significant impacts on Wikipedia-based studies and systems and introduce the subarticle matching problem. The goal of the sub-article matching problem is to automatically connect sub-articles to parent articles to help Wikipedia-based studies and systems retrieve complete information about a concept. We then describe the first system to address the sub-article matching problem. We show that, using a diverse feature set and standard machine learning techniques, our system can achieve good performance on most of our ground truth datasets, significantly outperforming baseline approaches. Title: Understanding Wikidata Queries By *Markus Kroetzsch* Abstract: Wikimedia provides a public service that lets anyone answer complex questions over the sum of all knowledge stored in Wikidata. These questions are expressed in the query language SPARQL and range from the most simple fact retrievals ("What is the birthday of Douglas Adams?") to complex analytical queries ("Average lifespan of people by occupation"). The talk presents ongoing efforts to analyse the server logs of the millions of queries that are answered each month. It is an important but difficult challenge to draw meaningful conclusions from this dataset. One might hope to learn relevant information about the usage of the service and Wikidata in general, but at the same time one has to be careful not to be misled by the data. Indeed, the dataset turned out to be highly heterogeneous and unpredictable, with strongly varying usage patterns that make it difficult to draw conclusions about "normal" usage. The talk will give a status report, present preliminary results, and discuss possible next steps. -- Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

1 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2017