Analytics April 2016

analytics@lists.wikimedia.org

34 participants
26 discussions

Query Improvement Question
by Justin Clark 26 Apr '16

26 Apr '16

Hi all, I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes. I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process. Here is a reference query that shows the kind of data I'm looking for: SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour; A couple options that come to mind: * a year > 0 query vs many yearly, monthly, daily, or hourly queries * batching articles with page_title IN (...) * dropping country_code to get all countries at once (or batch like above) * ordering posthoc to avoid the map-reduce overhead Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome. Thanks, Justin

3 4

usage of mobile clients
by Amir E. Aharoni 25 Apr '16

25 Apr '16

Hi, Do we have a more detailed view of % of clients accessing Wikimedia sites? What I can currently find is: https://browser-reports.wmflabs.org/#all-sites-by-browser It doesn't give me a detailed enough answer to the question: "Which mobile browsers do people use to read Wikipedia". The built-in browser of an Android-based phone from 2012 is quite different from the one 2016, but if I read that chart correctly, it shows both as just "Android". -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

3 2

Re: [Analytics] Query Improvement Question
by Justin Clark 25 Apr '16

25 Apr '16

Hi, Admin stuff: We are very aware of the sensitivity of the data. We've got a group working on improving privacy protection in data sharing (http://privacytools.seas.harvard.edu/), which is a tacit acknowledgement that data sharing as it exists often doesn't protect privacy. With that said, we've designed our workflow so that all the analysis happens on Wikimedia's servers, and everyone that needs access to that analysis right now has access to stat1002, so for the foreseeable future everything is staying at Wikimedia. I think it's a bit premature to talk about formats for publishing because we don't know what our findings are going to look like and many of these sensitivities depend on the social, political, numerical, etc. contexts. Once we reach that stage we'll absolutely make sure everyone is comfortable with what we'd be publishing. Tech stuff: This is incredibly helpful - thank you both very much. This is exactly what I was looking for, and way easier than starting from scratch. - Justin

2 1

Hive & Oozie downtime tomorrow
by Andrew Otto 20 Apr '16

20 Apr '16

Hi all! As part of https://phabricator.wikimedia.org/T130840, we need to schedule a short downtime for Hive and Oozie. I would like to proceed with this tomorrow if there are no objections. I’d like to schedule this downtime for an hour starting at 14:45 UTC (10:45 EST, 07:45 PST) Wednesday April 20th. The downtime will likely be less than an hour, but I’m blocking out an hour just in case. Sorry for the short notice! If this is going to cause anyone trouble let us know and we will reschedule with more notice. Thanks all! -Andrew

1 2

Problem in Pageviews Files
by hafez 20 Apr '16

20 Apr '16

Hi, thanks for the great work. I have a question about data files available at https://dumps.wikimedia.org/other/pageviews/ As I understood, in each line, the third column is number of pageviews and the fourth one is its total traffic. But the fourth column is always zero! Did I make a mistake in understanding the meaning of these numbers? For example in line 346 of file https://dumps.wikimedia.org/other/pageviews/2016/2016-04/projectviews-20160… we see "en - 4271631 0" which means there were 4271631 views for that project, with zero traffic! Thanks in advance, Hassan

2 1

Unique Devices data available on API
by Nuria Ruiz 19 Apr '16

19 Apr '16

Hello! The analytics team is happy to announce that the Unique Devices data is now available to be queried programmatically via an API. This means that getting the daily number of unique devices [1] for English Wikipedia for the month of February 2016, for all sites (desktop and mobile) is as easy as launching this query: https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/a… You can get started by taking a look at our docs: https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices#Quick_Start If you are not familiar with the Unique Devices data the main thing you need to know is that is a good proxy metric to measure Unique Users, more info below. Since 2009, the Wikimedia Foundation used comScore to report data about unique web visitors. In January 2016, however, we decided to stop reporting comScore numbers [2] because of certain limitations in the methodology, these limitations translated into misreported mobile usage. We are now ready to replace comscore numbers with the Unique Devices Dataset . While unique devices does not equal unique visitors, it is a good proxy for that metric, meaning that a major increase in the number of unique devices is likely to come from an increase in distinct users. We understand that counting uniques raises fairly big privacy concerns and we use a very private conscious way to count unique devices, it does not include any cookie by which your browser history can be tracked [3]. [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices [2] [https://meta.wikimedia.org/wiki/ComScore/Announcement [3] https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni… devices.3F

5 7

stat1002, stat1003
by Jon Katz 18 Apr '16

18 Apr '16

Hey Folks, I am suddenly having trouble logging into stat1002 or stat1003 using my normal settings. Here is what I get: *Jon-Katzs-Air:~ jkatz$ ssh jkatz(a)stat1003.eqiad.wmnet* *@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@* *@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @* *@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@* *IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!* *Someone could be eavesdropping on you right now (man-in-the-middle attack)!* *It is also possible that a host key has just been changed.* *The fingerprint for the RSA key sent by the remote host is* *[REDACTED BY JK].* *Please contact your system administrator.* *Add correct host key in /Users/jkatz/.ssh/known_hosts to get rid of this message.* *Offending RSA key in /Users/jkatz/.ssh/known_hosts:1* *RSA host key for bast1001.wikimedia.org <http://bast1001.wikimedia.org> has changed and you have requested strict checking.* *Host key verification failed.* *ssh_exchange_identification: Connection closed by remote host* Here is my host.config file: *ForwardAgent no* *Host !bast1001.wikimedia.org <http://bast1001.wikimedia.org> *.wikimedia.org <http://wikimedia.org> *.wmnet* * ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org <http://bast1001.wikimedia.org>* Is someone doing something nasty? Any ideas? Thanks, J

3 3

Reports on Views and Edits per Country
by Andre Klapper 18 Apr '16

18 Apr '16

http://stats.wikimedia.org/wikimedia/squids/SquidReportsCountriesLanguagesV… says it's "Discontinued since June 2015" but does not tell me where to find recent view/edit reports per country. Anybody knows if recent data exists and where it's available? Thanks in advance!, andre (PS: I also filed this as T131596 in Phabricator two weeks ago) -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

3 3

Student
by Nima Dashtban 15 Apr '16

15 Apr '16

Hi there, Hope my email finds you well. I need to know how can I access to coordinates of specific geographical places within this log: https://dumps.wikimedia.org/other/pagecounts-raw/ Thanks in advance, Nima Dashtban

3 3

Demographics survey
by Peter Ekman 14 Apr '16

14 Apr '16

Hi, Jimbo, on his talk page, said I should ask somebody at the WMF about what's been going on and what might be happening soon re: Wikipedia demographic surveys Please see, https://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Updated_demographic_sta… My feelings are: *This is an important problem that should be easily dealt with and can be solved by a commitment from the top brass, and *we need to take a spectacularly simple approach to just get it done. Any feedback would be appreciated via my e-mail or even on Jimbo's talk. Sincerely, Pete Ekman aka User:Smallbones

6 12

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics April 2016