Analytics December 2015

analytics@lists.wikimedia.org

39 participants
36 discussions

Code of Conduct session at Wikimedia Developer Summit 2016
by Matthew Flaschen 30 Dec '15

30 Dec '15

The Wikimedia Developer Summit starts this Monday, Jan. 4! There will be an information and discussion session about the in-progress Code of Conduct for technical spaces (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft) on Monday. Thanks, Matt Flaschen

1 0

timestamp wiki pagecounts
by Maurice Vergeer 24 Dec '15

24 Dec '15

Dear wiki analytics team I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate? the filename pagecounts-20151224-070000 indicate pagecounts between from 7am to 8am. This 7 AM to 8 AM period. in what geographical timezone is it? GMT, UTC + or - how many hours. thanks very much Maurice Vergeer -- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________

4 7

2015 accomplishments by Analytics@Wikimedia
by Kevin Leduc 24 Dec '15

24 Dec '15

Hi All, It has been my pleasure and pride to manage the Analytics Team @ Wikimedia these past 9 months. Below are slides and video presentations of some of our greatest accomplishments in 2015. BTW I will still be around in a new capacity managing special projects starting with socializing and defining new engagement metrics for wikipedia. A blogpost will be out in February 2016. Here are the 2015 highlights: New aggregated pageview dataset Slides on Commons <https://www.mediawiki.org/wiki/File:Pageview_@_Wikimedia_(WMF_Analytics_lig…> Presentation on YouTube <https://www.youtube.com/watch?v=dF3bzu_uL9s&t=9m40s> (7 minutes) <https://www.mediawiki.org/wiki/File:Pageview_@_Wikimedia_(WMF_Analytics_lig…> Pageview API Slides on Commons <https://www.mediawiki.org/wiki/File:PageviewAPI.pdf> Presentation on YouTube <https://www.youtube.com/watch?v=kE3lSfs1dzc&t=1m07s> (9 minutes) <https://www.mediawiki.org/wiki/File:PageviewAPI.pdf> EventLogging + Kafka Slides on Commons <https://www.mediawiki.org/wiki/File:EventLogging_on_Kafka_-_Lightning_Talk.…> Presentation on YouTube <https://www.youtube.com/watch?v=yUQ5d192z3M&t=1m11s> (11 minutes) <https://www.mediawiki.org/wiki/File:EventLogging_on_Kafka_-_Lightning_Talk.…> EventLogging Data Retention Slides on Commons <https://www.mediawiki.org/wiki/File:EventLogging_Data_Retention_Audit.pdf> Presentation on YouTube <https://www.youtube.com/watch?v=d9nTt8li3BE&t=11m48s> (11 minutes) Dashiki: dashboards configured on-wiki Slides on Google Docs <https://docs.google.com/presentation/d/1dDN9Kcw2n9tcrqPt9449Wy7rZewbkMZgOAh…> Presentation on YouTube <https://www.youtube.com/watch?v=dF3bzu_uL9s&t=26m51s> (10 minutes)

2 1

Stopping eventlogging events into MobileWikiAppShareAFact table
by Nuria Ruiz 23 Dec '15

23 Dec '15

Team: As part of our effort of converting eventlogging mysql database to the tokudb engine we need to stop eventlogging events from flowing into the MobileWikiAppShareAFact table, we are using this one table to see how long the conversion will take in order to plan for a larger outage window. Let us know if data should be backfilled as it can be, we anticipate events will not flow into table for the better part of one day. Thanks, Nuria

2 3

Page view API questions regarding user agent
by Felix J. Scholz 22 Dec '15

22 Dec '15

Dear All: Maybe this question is a little bit too simple, but I did not immediately find the answer in the docs. How does the API differentiate between the two user agents spider and bot? I'm asking because for some articles, there seems to be no bot traffic at all, including the main page in August: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedi… --- Another, unrelated question: By my recollection, I read somewhere that the data available via the API dates back to sometime in May of 2015. However, when doing queries today, the API only returned data starting on August 1, 2015. Is that correct? Best, Felix

6 12

Re: [Analytics] Data Collection
by Caitlin.Gardner＠csiro.au 21 Dec '15

21 Dec '15

Hi Dan, The aim of our project is to determine whether we can establish a prediction technique for high impact (not high risk) species before they enter Australia and New Zealand. We are using data for species of 18 industries that have already entered and are high or low impact species (at this stage removing moderate impact). Monthly pageviews going back further than May 2015 would be useful (or even daily pageviews, but monthly would suffice). I should be able to use this response for my analysis. Pageview data may only show us high risk pest species - but it is all worth an investigation for us. Unfortunately I'm not familiar with the methods used to access the older data - but the links you have all sent me will be useful and if I decide I need more data, I can try those methods myself. Thank you all for your help! I should be okay from here. Cheers, Caitlin -----Original Message----- From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of analytics-request(a)lists.wikimedia.org Sent: Saturday, 19 December 2015 5:46 AM To: analytics(a)lists.wikimedia.org Subject: Analytics Digest, Vol 46, Issue 38 Send Analytics mailing list submissions to analytics(a)lists.wikimedia.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request(a)lists.wikimedia.org You can reach the person managing the list at analytics-owner(a)lists.wikimedia.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..." Today's Topics: 1. Re: Data collection (Dan Andreescu) ---------------------------------------------------------------------- Message: 1 Date: Tue, 15 Dec 2015 08:50:11 -0500 From: Dan Andreescu <dandreescu(a)wikimedia.org> To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <analytics(a)lists.wikimedia.org> Subject: Re: [Analytics] Data collection Message-ID: <CA+aepCRqZ9YwHPCdo-1F-2uF-Vy3OiBZj=PjCD-MturFy0qyVA(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8" Hi Caitlin, Using the python client for the pageview API ( https://github.com/mediawiki-utilities/python-mwviews), you could do: from mwviews.api import PageviewsClient p = PageviewsClient() p.article_views('en.wikipedia', ['Abacarus_hystrix','Acarus_siro','Aceria_tosichella','Acyrthosiphon_pisum','Ahasverus_advena','Anthrenus_flavipes','Aphis_craccivora','Arhopalus','Balaustium_medicagoense','Bemisia_tabaci','Brevicoryne_brassicae','Bruchus','Ceratitis_capitata','Cicadulina','Cryptolestes','Daktulosphaira_vitifoliae','Delia','Ephestia_elutella','Ephestia_kuehniella','Etiella_behrii','Frankliniella_occidentalis','Frankliniella','Henosepilachna_vigintioctopunctata','Heteronychus_arator','Lachesilla_quercus','Lasioderma_serricorne','Liposcelis_bostrychophila','Macrosiphum_euphorbiae','Marchalina_hellenica','Myzus_persicae','Naupactus','Nezara_viridula','Oligonychus_ununguis','Oryzaephilus_surinamensis','Panonychus_ulmi','Penthaleus','Pieris_rapae','Piezodorus','Plodia_interpunctella','Plutella_xylostella','Rhopalosiphon','rhopalosiphum_maidis','Rhopalosiphum_padi','Rhyzopertha_dominica','Sirex_noctilio','Sitophilus_granarius','Sitophilus_oryzae','Sitotroga_cerealella','Sminthurus_viridis','Spodoptera_exempta','Stegobium_paniceum','Tetranychus','Thrips_palmi','Thrips','Tribolium_castaneum','Tribolium_confusum','Trogoderma_granarium','Trogoderma'], start='20150501') Some of the articles in your list don't exist on en.wikipedia (like 'Frankliniella') but for what exists this returns the views as far back as we have them. When we're done filling up the API we'll have data back to May 2015, but right now it only goes to August. If you need it back further, you have to parse the dumps as others have said. I'm curious why you need the older data, it's interesting to us as we try to figure out what else to expose through the API. Would monthly pageviews be just as good? I attached the result of that query here in JSON format -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.wikimedia.org/pipermail/analytics/attachments/20151215/3ee3dd…> -------------- next part -------------- A non-text attachment was scrubbed... Name: pageviews.json Type: application/json Size: 460113 bytes Desc: not available URL: <https://lists.wikimedia.org/pipermail/analytics/attachments/20151215/3ee3dd…> ------------------------------ Subject: Digest Footer _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics ------------------------------ End of Analytics Digest, Vol 46, Issue 38 *****************************************

2 1

What Wikimedia Research is up to in the next quarter
by Dario Taraborelli 19 Dec '15

19 Dec '15

Hey all, I’m glad to announce that the Wikimedia Research team’s goals <https://www.mediawiki.org/wiki/Wikimedia_Research/Goals#January_-_March_201…> for the next quarter (January - March 2016) are up on wiki. The Research and Data <https://www.mediawiki.org/wiki/Wikimedia_Research#Research_and_Data> team will continue to work with our volunteers and collaborators on revision scoring as a service <https://meta.wikimedia.org/wiki/R:Revscoring> adding support for 5 new languages and prototyping new models (including an edit type classifier <https://meta.wikimedia.org/wiki/Research:Automated_classification_of_edit_t…>). We will also continue to iterate on the design of article creation recommendations <https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage>, running a dedicated campaign in coordination with existing editathons to improve the quality of these recommendations. Finally, we will extend a research project we started in November aimed at understanding the behavior of Wikipedia readers <https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Be…> , by combining qualitative survey data with behavioral analysis from our HTTP request logs. The Design Research <https://www.mediawiki.org/wiki/Wikimedia_Research#Design_Research> team will conduct an in-depth study of user needs (particularly readers) on the ground in February. We will continue to work with other Wikimedia Engineering teams throughout the quarter to ensure the adoption of human-centered design principles and pragmatic personas <https://www.mediawiki.org/wiki/Personas_for_product_development> in our product development cycle. We’re also excited to start a collaboration <https://meta.wikimedia.org/wiki/Research:Publicly_available_online_learning…> with students at the University of Washington to understand what free online information resources (including, but not limited to, Wikimedia projects) students use. I am also glad to report that two papers on link and article recommendations (the result of a formal collaboration with a team at Stanford) were accepted for presentation at WSDM '16 and WWW ’16 (preprints will be made available shortly). An overview on revision scoring as a service <http://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/> was published a few weeks ago on the Wikimedia blog, and got some good media coverage <https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Media> . We're constantly looking for contributors and as usual we welcome feedback on these projects via the corresponding talk pages on Meta. You can contact us for any question on IRC via the #wikimedia-research channel and follow @WikiResearch <https://twitter.com/WikiResearch> on Twitter for the latest Wikipedia and Wikimedia research updates hot off the press. Wishing you all happy holidays, Dario and Abbey on behalf of the team *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

1 0

[Outage] Small data loss in raw_webrequest on 2015-12-15
by Marcel Ruiz Forns 18 Dec '15

18 Dec '15

Hi Analytics, Yesterday, Dec 15, during the course of 1 hour (17h to 18h UTC) there was an irrecoverable raw_webrequest data loss of ~30%: 25.6% (misc), 19.5% (mobile), 19.1% (text), 39.1% (upload). This represents around 1% of the data for that day. The loss was due to the enabling of IPSec, which encrypts varniskafka traffic between caches in remote datacenters and the Kafka brokers in eqiad. During a period of about 40ish minutes, no webrequest logs from remote datacenters were successfully produced to Kafka. Here's the outage note: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_k… Sorry for the inconvenience. -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

2 1

Raw data files
by Pierre Alexis Panel 18 Dec '15

18 Dec '15

Good afternoon, I am a student at Berkeley and I am using the raw wikipedia pagecounts data for a project. I had one quick question: in the name of the files to download, are you using London time? Thank you Pierre Alexis Panel Sent from my iPhone

2 1

pageviews.js—A JavaScript Client Library for the Wikimedia Pageviews API for Node.js and the browser
by Thomas Steiner 18 Dec '15

18 Dec '15

Dear all, First and foremost, thanks for making the Wikimedia Pageviews API available; your work is highly appreciated and super useful! As a modest "thank you", I am happy to release the JavaScript client library pageviews.js for Node.js and the browser to make working with this API easy for JavaScript developers. Please find the code and all instructions at [1]. The library adds some convenience functions (getting batch pageviews and limiting the number of results) that were inspired by Dan Andreescu's Python library [2] and is Promise-based: === var pageviews = require('pageviews'); // Getting pageviews for a single article pageviews.getPerArticlePageviews({ article: 'Berlin', project: 'en.wikipedia', start: '20151201', end: '20151202' }).then(function(result) { console.log(JSON.stringify(result, null, 2)); }).catch(function(error) { console.log(error); }); // Getting top-n items ranked by pageviews for multiple projects pageviews.getTopPageviews({ projects: ['en.wikipedia', 'de.wikipedia'], // Plural year: '2015', month: '12', day: '01', limit: 2 // Limit to the first n results }).then(function(result) { console.log(JSON.stringify(result, null, 2)); }).catch(function(error) { console.log(error); }); === On a more technical note—trying to be a good citizen [3]—the client library sets an identifying User-Agent header in Node.js mode. However, trying to set the corresponding X-User-Agent (note the "X-") header from a browser context (XMLHttpRequest cannot override the browser's intrinsic User-Agent for security reasons), this fails with an error message "Request header field X-User-Agent is not allowed by Access-Control-Allow-Headers in preflight response". Maybe you could change your CORS settings and include X-User-Agent in your Access-Control-Allow-Headers?! Hope this is useful. Thanks, Tom -- [1] pageviews.js: https://github.com/tomayac/pageviews.js [2] python-mwviews: https://github.com/mediawiki-utilities/python-mwviews [3] User-Agent requirement: https://wikimedia.org/api/rest_v1/?doc -- Dr. Thomas Steiner, Employee (blog.tomayac.com, twitter.com/tomayac) Google Germany GmbH, ABC-Str. 19, 20354 Hamburg Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle Registergericht und -nummer: Hamburg, HRB 86891 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.29 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/ -----END PGP SIGNATURE-----

2 2

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics December 2015