Analytics September 2015

analytics@lists.wikimedia.org

40 participants
25 discussions

Gerrit Cleanup Day on Wed 23rd: Are you ready?
by Andre Klapper 22 Sep '15

22 Sep '15

Hi Analytics, the Gerrit Cleanup Day on Wed 23rd is approaching fast - less than one week left. More info: https://phabricator.wikimedia.org/T88531 Do you feel prepared for the day and all team members know what to do? If not, what are you missing and how can we help? Some Gerrit queries for each team are listed under "Gerrit queries per team/area" in https://phabricator.wikimedia.org/T88531 Are they helpful and a good start? Or do they miss some areas (or do you have existing Gerrit team queries to use instead or to "integrate", e.g. for parts of MediaWiki core you might work on)? Also, which person will be the main team contact for the day (and available in #wikimedia-dev on IRC) and help organize review work in your areas, so other teams could easily reach out? Some team plates are emptier than others so they're wondering where and how to lend a helping hand (to find out in advance, due to timezones). Thanks for your help to make the Gerrit Cleanup day a success! andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

3 3

[Survey] Pageview API
by Dan Andreescu 22 Sep '15

22 Sep '15

Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices: Choice 1. /top/{project}/{access}/{days-in-the-past} Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30 Choice 2. /top/{project}/{access}/{start}/{end} Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30 (in all of those, * {project} means en.wikipedia, commons.wikimedia, etc. * {access} means access method as in desktop, mobile web, mobile app ) Which do you prefer? Would any other query style be useful?

17 37

Get off list
by leeseeatq＠aol.com 21 Sep '15

21 Sep '15

Please remove me from your mailing list. I am not sure how I got on it to begin with, but it is a waste of your resources and mine. Thank you.

3 2

[job opening] Software Engineer - Research
by Dario Taraborelli 21 Sep '15

21 Sep '15

Hello, lists. You may have heard of projects <https://www.mediawiki.org/wiki/Wikimedia_Research#Highlights> such as revision scoring or article recommendations. We’re now looking for a full-stack software engineer to join Wikimedia Research and support and scale up these and similar projects. Job description below, please help us find the best possible candidates. Dario Software Engineer - Research <http://grnh.se/b12qur> Summary Help us create a world in which every single human being can freely share in the sum of all knowledge. We are a team of scientists and UX researchers at the Wikimedia Foundation using data to understand and empower millions of users – readers, contributors, and donors – who interact with Wikipedia and its sister projects on a daily basis. We turn research questions into publicly shared knowledge, we design and test new technology, we produce data-driven insights to support product and engineering decisions and we publish research informing the organization’s and the movement’s strategy. We are strongly committed to principles of transparency, privacy and collaboration, we use free and open source technology and we collaborate with researchers in the industry and academia. As a full member of the Wikimedia Research department, you will help us build and scale the infrastructure our team needs for research and experimentation, implementing new technology and data-intensive applications. Description Collaborate with researchers to expose algorithms and machine learning systems through APIs and web applications Design, develop, test, and deploy new features, improvements and upgrades to the infrastructure that supports research and powers data-intensive applications. Support our data science team in optimizing computationally intensive data processing. Support our UX research capacity by growing, expanding and maintaining our user testing platform and instrumentation stack. Work in coordination with other infrastructure teams such as Services and Analytics Engineering as well as Product teams to grow and scale research-driven services and applications. Requirements Real world experience writing applications using both scripting (e.g. Python, Javascript, PHP) and compiled languages (e.g. Java, Scala, C, C#) Experience with MySQL/Postgres or similar database technology Experience developing APIs for data retrieval Understanding of basic statistical concepts BS, MS, or PhD in Computer Science, Mathematics, or equivalent work experience Pluses Experience with high-traffic web architectures and operations Production experience with Hadoop and ecosystem technology (Pig, Hive, streaming) Experience with web UI design (Javascript, HTML, CSS) Familiarity with scientific computing libraries in Python and R Experience working with volunteers Big ups if you are a contributor to Wikipedia or other open collaboration projects Show us your stuff! Please provide us with information you feel would be useful to us in gaining a better understanding of your technical background and accomplishments. Links to GitHub, your technical blogs, publications, personal projects, etc. are exceptionally useful. We especially appreciate pointers to your best contributions to open source projects. About the Wikimedia Foundation The Wikimedia Foundation is the non-profit organization that operates Wikipedia, the free encyclopedia. Wikipedia and the other projects operated by the Wikimedia Foundation receive more than 431 million unique visitors per month, making them the 5th most popular web property worldwide. Available in more than 287 languages, Wikipedia contains more than 32 million articles contributed by a global volunteer community of more than 100,000 people. Based in San Francisco, California, the Wikimedia Foundation is an audited, 501(c)(3) charity that is funded primarily through donations and grants. The Wikimedia Foundation was created in 2003 to manage the operation of Wikipedia and its sister projects. It currently employs over 208 staff members. Wikimedia is supported by local chapter organizations in 40 countries or regions. The Wikimedia Foundation offers competitive benefits. Fully paid medical, dental, and vision coverage for employees and their eligible families (yes, fully paid premiums!). A Wellness Program which provides reimbursement for mind, body and soul activities such as fitness memberships, massages, cooking classes and much more. 401(k) retirement plan with matched contributions of 4% of annual salary. More Information http://wikimediafoundation.org <http://wikimediafoundation.org/> http://blog.wikimedia.org <http://blog.wikimedia.org/> http://wikimediafoundation.org/wiki/Vision <http://wikimediafoundation.org/wiki/Vision> About Wikimedia Research https://www.mediawiki.org/wiki/Wikimedia_Research <https://www.mediawiki.org/wiki/Wikimedia_Research> Examples of code https://github.com/wiki-ai/revscoring <https://github.com/wiki-ai/revscoring> https://github.com/wiki-ai/ores <https://github.com/wiki-ai/revscoring> https://github.com/halfak/MediaWiki-Utilities <https://github.com/halfak/MediaWiki-Utilities> https://github.com/halfak/mwstreaming <https://github.com/halfak/mwstreaming> Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org <http://nitens.org/> • @readermeter <http://twitter.com/readermeter>

1 0

Intervention analysis (Re: Wikimedia traffic forecast application)
by Tilman Bayer 18 Sep '15

18 Sep '15

This app is really cool. I wonder if beside future predictions, it could be modified to support another use case: Assessing the impact of past events and software changes on our pageviews. As many of us are aware, the Wikimedia movement has been struggling for a long time to understand the effects of our work (and of outside events) on our readership. And while WMF engineering teams are getting better about doing, say, A/B tests, it's often not possible to provide a controlled environment for such experiments. There's an established statistical technique aimed at such situations, called "Intervention Analysis", see e.g. [1]. It requires modeling the time series (here: monthly pageviews) with an ARIMA model just like it has been done in the app. One then basically does a backdated forecast from the time of the intervention, and uses the difference between that forecast and the actual development to model the effect of the intervention. I've been wondering recently if this has ever been used for Wikipedia pageviews; yesterday while attending Morten's research showcase talk about their "misalignment" paper I noticed that that paper has indeed been applying it (to views of individual articles, where it may be easier to isolate effects).[2] Is anyone aware of other examples? Would it be possible to modify the app to support such backdated forecasts, as a first step, and also for calculating their difference to the actual development? [1] https://onlinecourses.science.psu.edu/stat510/node/76 [2] http://www-users.cs.umn.edu/~morten/publications/icwsm2015-popularity-quali… (p.8) On Tue, Sep 15, 2015 at 5:28 PM, Dario Taraborelli <dtaraborelli(a)wikimedia.org> wrote: > > An updated version of a pageview forecasting application written by Ellery (Research & Data team) has just been released: > > https://ewulczyn.shinyapps.io/pageview_forecasting > https://twitter.com/WikiResearch/status/643942154549592064 > > The data is refreshed monthly and it includes breakdowns by country and platform. > > Dario > > > > Dario Taraborelli Head of Research, Wikimedia Foundation > wikimediafoundation.org • nitens.org • @readermeter > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

3 2

Pageviews definition + measurement for apps adding link previews + using RESTBase
by Bernd Sitzmann 18 Sep '15

18 Sep '15

This discussion is about needed updates of the definition and Analytics implementation for mobile apps page view metrics. There is also an associated Phab task[4]. Please add the proper Analytics project there. Background / Changes As you probably remember, the Android app splits a page view into two requests: one for the lead section and metadata, plus another one for the remainder. The mobile apps are going to change the way they load pages in two different ways: 1. We'll add a link preview when someone clicks on a link from a page. 2. We're planning on switching over the using RESTBase for loading pages and also the link preview (initially just the Android beta, ater more) This will have implications for the pageviews definition and how we count user engagement. The big question is Should we count link previews as a page view since it's an indication of user engagement? Or should there be a separate metric for link previews? Counting page views IIRC we currently count action=mobileview&sections=0 query parameters of api.php as a page view. When we publish link previews for all Android app users then we would either want to count also the calls to action=query&prop=extracts as a page view or add them to another metric. Once the apps use RESTBase the HTTPS requests will be very different: - Page view: Instead of action=mobileview&sections=0 the app would call the RESTBase endpoint for lead request[1] instead of the PHP API mentioned above. Then it would call [2]. - Link preview: Instead of action=query&prop=extracts it would call the lead request[1], too, since there is a lot of overlap. At least that our current plan. The advantage of that is that the client doesn't need to execute the lead request a second time if the user clicks on the link preview (-- either through caching or app logic.) So, in the RESTBase case we either want to count the mobile-html-sections-lead requests or the mobile-html-sections-remaining requests depending on what our definition for page views actually is. We could also add a query parameter or extra HTTP header to one of the mobile-html-sections-lead requests if we need to distinguish between previews and page views. Both the current PHP API and the RESTBase based metrics would need to be compatible and be collected in parallel since we cannot control when users update their apps. [1] https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Dilbert [2] https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-remaining/Di… [3] https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_ap… [4] https://phabricator.wikimedia.org/T109383 Cheers, Bernd

10 33

Issue on cluster - Delay in data
by Joseph Allemandou 17 Sep '15

17 Sep '15

Hi Analytics listeners, We are experiencing issues on the hadoop cluster (more precisely logs don't flow from kafka into HDFS, even more precisly camus job seems broken). Consequence is that data is late from yesterday 2015-09-16 hour 21. We are actively working on that and will let you know when solved. -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

1 1

Wikipedia Pagecounts files missing
by Diana Arquillos 16 Sep '15

16 Sep '15

Hi, My name is Diana. I am a developer in AOL company. We are partners of Wikipedia. Currently, we consume the pagecounts raw files that you generate every day as an input of our Wikipedia infrastructure. http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-09/ We have noticed that there are new files since yesterday at 18, and because of that we are having issues in our side. I would like to know if this issue is already noticed in your side and if there is a current action to fix the generation of this files. Thanks in advance, Diana

2 1

Wikimedia traffic forecast application
by Dario Taraborelli 16 Sep '15

16 Sep '15

An updated version of a pageview forecasting application written by Ellery (Research & Data team) has just been released: https://ewulczyn.shinyapps.io/pageview_forecasting <https://ewulczyn.shinyapps.io/pageview_forecasting> https://twitter.com/WikiResearch/status/643942154549592064 <https://twitter.com/WikiResearch/status/643942154549592064> The data is refreshed monthly and it includes breakdowns by country and platform. Dario Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org <http://nitens.org/> • @readermeter <http://twitter.com/readermeter>

1 0

[Bug] Event Logging client IP hashing
by Dan Andreescu 15 Sep '15

15 Sep '15

When we process Event Logging events, we hash the origin IP address and add it to the event as part of the "capsule. We salt the hash function and rotate the salt frequently for security, but within those periods of time the same IP would get hashed to the same hash, and some people depended on that. We recently made the Event Logging processor parallel, and we accidentally forgot to make this hashing consistent across all the parallel instances. So from September 10, 2015 until we fix the bug, client IPs will not be hashed consistently. We are tracking this issue here: https://phabricator.wikimedia.org/T112688 If you have some data crunching that's affected by this, come talk to us. We are already adding a temporary fix to the scripts that generate the edit-analysis dashboard [1] [1] https://edit-analysis.wmflabs.org/compare/

2 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2015