Analytics February 2018

analytics@lists.wikimedia.org

31 participants
19 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

How best to accurately record page interactions in Page Previews
by Sam Smith 12 Apr '18

12 Apr '18

Hullo, Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it. Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client: 1. All files viewed with the media viewer are recorded by the client requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. 2. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set. IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3]. Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially? Thanks, -Sam Timezone: GMT IRC (Freenode): phuedx [0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269 <https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…>* [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365

8 43

Wikipedia internal search clickstream
by Georg Sorst 15 Mar '18

15 Mar '18

Hi list, as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer. This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose. I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries. Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences. Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo] Jakob-Haringer-Str. 5a | 5020 <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&en…> Salzburg <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&en…> I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst(a)findologic.com www.findologic.com Folgen Sie uns auf: XING <https://www.xing.com/profile/Georg_Sorst> facebook <http://www.facebook.com/Findologic/> Twitter <https://twitter.com/findologic> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier <http://www.findologic.com> geht es zu unserer *Homepage*!

3 9

Migrated Reportcard with Updated Data
by Nuria Ruiz 12 Mar '18

12 Mar '18

Hello! The Analytics team would like to announce that we have migrated the reportcard to a new domain: https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-… The migrated reportcard includes both legacy and current pageview data, daily unique devices and new editors data. Pageview and devices data is updated daily but editor data is still updated ad-hoc. The team is working at this time on revamping the way we compute edit data and we hope to be able to provide monthly updates for the main edit metrics this quarter. Some of those will be visible in the reportcard but the new wikistats will have more detailed reports. You can follow the new wikistats project here: https://phabricator.wikimedia.org/T130256 Thanks, Nuria

4 6

PageView
by BTShasSTOLENmyHEART 28 Feb '18

28 Feb '18

Hello, I recently spoke with "Next Big Sound" which is a company that tracks Wikipedia page views on certain artists. They informed me that they got details of the views directly from Wikipedia (because I had emailed them that the View counts mentioned on Wikipedia and Next Big Sound show a major discrepancy). There are rumors flying about saying that the information only gathered is from Desktop Views, in which the counts are extremely similar. Is there any way you can confirm this as true? Or is there another method you also count that is gathered for other companies that collect views? I know you have no idea of what Next Big Sound is presenting to the world wide audience, but I wanted to know if you can explain what information is given to Next Big Sound in terms of data. Thank you Sincerely, Angelina Zamora

1 0

Interesting (?) third party SEO study
by Jaime Crespo 26 Feb '18

26 Feb '18

I run by chance into this story of SEO analysis of domains- hopefuly this is not offtopic here: https://blog.searchmetrics.com/us/2018/02/14/seo-world-rankings-2018/ Obviously I don't know how scientific the study is, but there seems to be 2 conclusions regarding Wikimedia- Brazil seems to be behind in Wikipedia adoption (or just Google SEO?), and a big growth of wiktionary.org in the last year. -- Jaime Crespo <http://wikimedia.org>

4 3

How to get old page views data?
by Lars Hillebrand 23 Feb '18

23 Feb '18

Dear Analytics Team, I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists (https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/>) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage. Are there any possibilities to get the page views data for certain articles from before July 2017? Thanks a lot and best regards, Lars Hillebrand PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.

7 12

Wikistats 2.0 - Now with Maps!
by Nuria Ruiz 22 Feb '18

22 Feb '18

Hello from Analytics team: Just a brief note to announce that Wikistats 2.0 includes data about pageviews per project per country for the current month. Take a look, pageviews for spanish wikipedia this current month: https://stats.wikimedia.org/v2/#/es.wikipedia.org/reading/pageviews-by-coun… Data is also available programatically vi APIs: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageviews_split… We will be deploying small UI tweaks during this week but please explore and let us know what you think. Thanks, Nuria

3 3

Research Showcase Wednesday, February 21, 2018 [External]
by Sarah R 21 Feb '18

21 Feb '18

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, February 21, 2018 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=fpmRWCE7F_I As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. This month's presentation: *Visual enrichment of collaborative knowledge bases* By Miriam Redi, Wikimedia Foundation Images allow us to explain, enrich and complement knowledge without language barriers [1]. They can help illustrate the content of an item in a language-agnostic way to external data consumers. Images can be extremely helpful in multilingual collaborative knowledge bases such as Wikidata. However, a large proportion of Wikidata items lack images. More than 3.6M Wikidata items are about humans (Q5), but only 17% of them have an image associated with them. Only 2.2M of 40 Million Wikidata items have an image. A wider presence of images in such a rich, cross-lingual repository could enable a more complete representation of human knowledge. In this talk, we will discuss challenges and opportunities faced when using machine learning and computer vision tools for the visual enrichment of collaborative knowledge bases. We will share research to help Wikidata contributors make Wikidata more “visual” by recommending high-quality Commons images to Wikidata items. We will show the first results on free-licence image quality scoring and recommendation and discuss future work in this direction. [1] Van Hook, Steven R. "Modes and models for transcending cultural differences in international classrooms." Journal of Research in International Education 10.1 (2011): 5-27. http://journals.sagepub.com/doi/abs/10.1177/1475240910395788 *Backlogs—backlogs everywhere: Using machine classification to clean up the new page backlog* By Aaron Halfaker, Wikimedia Foundation If there's one insight that I've had about the functioning of Wikipedia and other wiki-based online communities, it's that eventually self-directed work breaks down and some form of organization becomes important for task routing. In Wikipedia specifically, the notion of "backlogs" has become dominant. There's backlogs of articles to create, articles to clean up, articles to assess, new editor contributions to review, manual of style rules to apply, etc. To a community of people working on a backlog, the state of that backlog has deep effects on their emotional well being. A backlog that only grows is frustrating and exhausting. Backlogs aren't inevitable though and there are many shapes that backlogs can take. In my presentation, I'll tell a story about where English Wikipedia editors defined a process and set of roles that formed a backlog around new page creations. I'll make the argument that this formalization of quality control practices has created a choke point and that alternatives exist. Finally I'll present a vision for such an alternative using models that we have developed for ORES, the open machine prediction service my team maintains. -- Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

1 2

Pageview dumps lagging behind
by Spinner Cat 20 Feb '18

20 Feb '18

Hi all, Noticed that we're not getting any new pageview dumps on https://dumps.wikimedia.org/other/pageviews/2018/2018-02/ since Feb 9th 17:08 UTC. Is this a known issue and when might we expect it to be resolved and files to catch up again? Thanks!

4 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2018