Analytics September 2015

analytics@lists.wikimedia.org

40 participants
25 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

User statistics for video marking ENWP 5m article milestone

by Pine W

Hi Analytics, On ENWP, does the number of 26,163,773 users include IPs who have made edits? Does it include editors on all Wikimedia projects or just those who have registered and/or edited on ENWP? Thanks, Pine

8 years, 6 months

Relevant Content Availability

by Abdel Samad, Rawia

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

8 years, 6 months

Feedback needed on intro, "Principles", "Expected behavior" and "Unacceptable behavior" sections of Code of Conduct

by Matthew Flaschen

You may have heard about the in-progress work on the Code of Conduct for Wikimedia technical spaces (https://www.mediawiki.org/wiki/Code_of_conduct_for_technical_spaces/Draft). It is currently in draft form, and we are in the process of finalizing the intro, "Principles", "Expected behavior" and "Unacceptable behavior" sections. An earlier version of these sections (except for "Expected behavior") reached consensus. However, there is now a new draft, and you can weigh in on whether to use it instead: https://www.mediawiki.org/wiki/Talk:Code_of_conduct_for_technical_spaces/Dr… . I will continue to ask for your feedback as we discuss the remaining sections later. Thanks, Matt Flaschen

8 years, 6 months

Unexplained change in collected data

by Erik Bernhardson

Last week we started up a new AB test[1] comparing the existing completion suggestions against a new completion suggestion API. This very simply puts 1 in 10000 users into the test bucket, and a further 1 in 10000 users into the control bucket like so: - function oneIn(population) { - return Math.floor( Math.random() * populationSize ) === 0; - } - if ( oneIn( 10000 ) ) { - // test bucket - } else if ( oneIn ( 10000 ) ) { - // sample bucket - } else { - return; // rejected - } - On every page load we generate a random 64 bit number via `mw.user.generateRandomSessionId()`. This is used to correlate together events that were performed by the same user on the same page. This is logged with all our events as event_pageId. In older tests (this was turned off September 3rd) using this same event_pageId scheme roughly 0.3% of event_pageId values came from multiple IP addresses, which seems sane and normal: - mysql:research@analytics-store.eqiad.wmnet [log]> select count, count(count) from (select count(distinct clientIp) as count from TestSearchSatisfaction_12423691 group by event_pageId) x group by count; - +-------+--------------+ - | count | count(count) | - +-------+--------------+ - | 1 | 411104 | - | 2 | 1500 | - +-------+--------------+ - 2 rows in set (3.11 sec) - On the test we just started though, we are seeing 48% of event_pageId values being reported by multiple ip addresses. We can't seem to find any way to explain why this has changed so much, and as such are uncertain we can rely on the other data collected by this same test. - mysql:research@analytics-store.eqiad.wmnet [log]> select count, count(count) from (select count(distinct clientIp) as count from CompletionSuggestions_13424343 group by event_pageId) x group by count; - +-------+--------------+ - | count | count(count) | - +-------+--------------+ - | 1 | 1176 | - | 2 | 243 | - | 3 | 254 | - | 4 | 212 | - | 5 | 143 | - | 6 | 102 | - | 7 | 64 | - | 8 | 36 | - | 9 | 16 | - | 10 | 14 | - | 11 | 8 | - | 12 | 5 | - +-------+--------------+ - 12 rows in set (0.03 sec) We have a third schema in production that has been collecting events the entire time. It seems to have started showing this issue on September 10th which lines up with a thursday train deployment: mysql:research@analytics-store.eqiad.wmnet [log]> select date, MAX(count) from (select substr(timestamp, 1, 8) as date, count(distinct clientIp) as count from TestSearchSatisfaction2_13223897 group by substr(timestamp, 1, 8), event_pageId) x group by date;; - +----------+------------+ - | date | MAX(count) | - +----------+------------+ - | 20150902 | 1 | - | 20150903 | 2 | - | 20150904 | 2 | - | 20150905 | 4 | - | 20150906 | 3 | - | 20150907 | 3 | - | 20150908 | 3 | - | 20150909 | 3 | - | 20150910 | 11 | - | 20150911 | 12 | - | 20150912 | 14 | - | 20150913 | 18 | - | 20150914 | 13 | - +----------+------------+ - 13 rows in set (1.74 sec) Does anyone have any ideas for where this change could have come from? [1] https://gerrit.wikimedia.org/r/#/c/236937/1/modules/ext.wikimediaEvents.sea…

8 years, 6 months

Planned kernel updates/reboots on stat1001/stat1002/stat1003

by Moritz Mühlenhoff

Hi, I need to reboot stat1001, stat1002, stat1003 to update the running Linux kernels on these hosts. I'm planning to make the reboots starting tomorrow 30th September at 13:00 UTC (6am pacific time). If that is a bad time (e.g. because you have long-running or crucial scripts runinng on one of them, please get in touch with me and we can move it to another time. Cheers, Moritz

8 years, 6 months

Kudu

by Toby Negrin

>From the intertubes: @tlipcon: Super excited to finally talk about what I've been working on the last 3 years: Kudu! http://t.co/1W4sqFBcyH http://t.co/1mZCwgdOO5 Might be useful for the media wiki tables. -Toby

8 years, 6 months

Pageview_hourly dataset. Preventing Identity reconstruction

by Nuria Ruiz

Hello, We have been working on the exercise of reconstructing an identity using the (still private) pageview_hourly dataset ( https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly) TL;DR It is possible (and easy) to do that with the fields the dataset has now, before releasing it publicly we need to further anonymize it. More info here: https://wikitech.wikimedia.org/wiki/Analytics/Data/PreventingIdentityRecons… Thanks, Nuria

8 years, 6 months

Wikimedia Dumps Behind? 9/16 and 9/17

by Tony Ho

Hi WikiMedia Analytics, I'm a student who has been doing work with the page count files from wikimedia. During the last few days, it looks like the latest page count is being published slower than before. Usually, when I go to the following link: http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-09/. I could see what happened an hour ago sometime within an hour or so after that. Is this property still going to be true? This seems to not be the case for 9/16 and 9/17. Also, pagecounts-20150916-090000.gz <http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-09/pagecoun…> does not seem to be of correct size. Thanks, Tony Ho

8 years, 7 months

Tackling the Analytics/wikistats Gerrit backlog

by Andre Klapper

Hi Analytics, your input on the analytics/wikistats task in https://phabricator.wikimedia.org/T113695 is welcome to find the best way in the next weeks how to move forward. Who could try to tackle this? Thanks in advance for your help! andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

8 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2015