Analytics March 2018

analytics@lists.wikimedia.org

24 participants
19 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site
by Dan Andreescu 01 Aug '18

01 Aug '18

Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc. Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time: *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org…>* And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&form…. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats. The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun! On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com > wrote: > Hi Dan, > > How are you! This is Victor, It's been a while since we meet at the 2018 > Wikimedia Dev Summit. I hope you are doing great. > > As I mentioned to you, my team works on extracting the knowledge from > Wikipedia. Currently it's undergoing a project that expands language > coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this > project.She plans to *monitor the list of all the current available > wikipedia's sites and the number of articles for each language*, so that > we can compare with our extraction system's output to sanity-check if there > is a massive breakage of the extraction logic, or if we need to add/remove > languages in the event that a new wikipedia site is introduced to/remove > from the wikipedia family. > > I think your team at Analytics at Wikimedia probably knows the best where > we can find this data. Here are 4 places we already know, but doesn't seem > to have the data. > > > - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the > information we need, but the list is manually edited, not automatic > - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but > the information seems pretty out of date(last updated almost a month ago) > - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't > find the full list nor the number of articles > - API https://wikimedia.org/api/rest_v1/ suggested by elukey on > #wikimedia-analytics channel, it doesn't seem to have # of article > information > > Do you know what is a good place to find this information? Thank you! > > Victor > > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 > > ---------- Forwarded message ---------- > From: Yuan Gao <gaoyuan(a)google.com> > Date: Wed, Mar 28, 2018 at 4:15 PM > Subject: Monitor the number of Wikipedia sites and the number of articles > in each site > To: Zainan Victor Zhou <zzn(a)google.com> > Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com> > > > Hi Victor, > as we discussed in the meeting, I'd like to monitor: > 1) the number of Wikipedia sites > 2) the number of articles in each site > > Can you help us to contact with WMF to get a realtime or at least daily > update of these numbers? What we can find now is > https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of > Wikipedia sites is manually updated, and possibly out-of-date. > > > The monitor can help us catch such bugs. > > -- > Yuan Gao > >

8 13

How best to accurately record page interactions in Page Previews
by Sam Smith 13 Apr '18

13 Apr '18

Hullo, Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it. Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client: 1. All files viewed with the media viewer are recorded by the client requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. 2. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set. IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3]. Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially? Thanks, -Sam Timezone: GMT IRC (Freenode): phuedx [0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269 <https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…>* [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365

8 43

New SWAP (Jupyter Notebook) servers and updates!
by Andrew Otto 12 Apr '18

12 Apr '18

Hi everyone! *tl;dr stop using notebook1001 by Monday April 2nd, use notebook1003 instead.* *(If you don’t have production access, you can ignore this email.)* As part of https://phabricator.wikimedia.org/T183145, we’ve ordered new hardware to replace the aging notebook1001. The new servers are ready to go, so we need to schedule a deprecation timeline for notebook1001. That timeline is Monday April 2nd. After that, your work on notebook1001 will not longer be accessible. Instead you should use notebook1003 (or notebook1004). But there is good news too! Last week I rsynced everyone’s home directories from notebook1001 over to notebook1003. I also upgraded the default virtualenv your notebooks run from. Your notebook files should all be accessible on notebook1003. However, the version of Python3 changed from 3.4 to 3.5 during this upgrade. Dependencies that your notebook uses that you installed on notebook1001 may not be available at first. You might need to redo a pip install those dependencies into the new notebook Python 3.5 virtualenv. (I can’t really give you explicit instructions to do that, as I don’t know what you use for your notebooks.) I’ll do a final rsync any newer files in home directories from notebook1001 on Monday April 2nd. If you’ve been working on notebook1001 since after March 15th, this should get everything up to date on notebook1003 before notebook1001 goes away. BUT! *Do not work on both notebook1001 and notebook1003*! My final rsync will keep the most recently modified version of files from either server. OOooOo and there’s even more good news! I’ve made the notebooks able to access system site packages, and installed a ton of useful packages <https://github.com/wikimedia/puppet/blob/production/modules/statistics/mani…> by default. pandas, scipy, requests, etc. If there’s something else you think you might need, let us know. Or just pip install it into your notebook. Additionally, pyhive has been installed too, so you should be able to more easily access Hive directly from a python notebook. I’ve updated docs at https://wikitech.wikimedia.org/wiki/SWAP#Usage, please take a look. If you have any questions, please don’t hesitate to ask, either here on or phabricator: https://phabricator.wikimedia.org/T183145. - Andrew Otto & Analytics Engineering

5 9

WMF stat servers /mnt/data NFS server migration
by Madhumitha Viswanathan 02 Apr '18

02 Apr '18

Hey stat1005|6 users! The underlying host currently providing all of your dumps and datasets needs over NFS (at /mnt/data) is being replaced soon. All datasets will be continue to be accessible on the stat boxes at the current path, but there will be a transition time of a few hours. During that time, you may encounter stale data or the files may simply be inaccessible. Please schedule your work accordingly. Dates: The migration is scheduled for April 2nd starting at 14:30 UTC, and is expected to last a few hours. Thanks! We'll send more updates closer to the migration date. If you have any questions, just let us know. -- Madhumitha Viswanathan & Ariel Glenn

1 3

Whitelisting the Pageviews API to avoid Content Security Policy warnings
by Leon Ziemba 26 Mar '18

26 Mar '18

Hello Analytics! Recently, it seems browsers started throwing warnings when attempting to load resources via XHR, unless they are whitelisted with a meta tag (I think is how it works). So for instance, in the JavaScript console, https://tools.wmflabs.org/pageviews now throws the warning: [Report Only] Refused to connect to ' https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedi…' because it violates the following Content Security Policy directive: "default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: *.wikibooks.org *.wikidata.org *.wikimedia.org *.wikinews.org *.wikipedia.org *.wikiquote.org *.wikisource.org *.wikiversity.org *. wikivoyage.org *.wiktionary.org *.wmflabs.org wikimediafoundation.org *. mediawiki.org ". Note that 'connect-src' was not explicitly set, so 'default-src' is used as a fallback. This is not an issue with the Pageviews API, specifically, but it appears many of the tools using it are affected (Treeviews <https://tools.wmflabs.org/glamtools/treeviews/>, Wikistats <https://tools.wmflabs.org/wikistats/>, etc.). So I was hoping you kind folks would know of a solution? I've been trying to go by https://developers.google.com/web/fundamentals/security/csp/ for clues. I think we need something similar to: <meta http-equiv="Content-Security-Policy" content="connect-src 'self' wikimedia.org;"> But this does not do the trick. Any ideas? Many thanks, ~Leon

3 2

Latency of hourly vs daily endpoints?
by Ahmed Fasih 26 Mar '18

26 Mar '18

Hello! I have some questions about the latency of some Wikipedia REST endpoints from https://wikimedia.org/api/rest_v1 I see that I can get very recent pageviews data, e.g. https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/… accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews on the English Wikipedia at timestamp "2018032200", so with about ~4 hours latency, very nice! In contrast, asking for the daily number of edits via https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/en.wikipedia/all-… only gives me data up to the end of February, with no March data. This makes me think the daily datasets are generated only once a month? How might I gain access to more recent daily data like the "rest_v1/metrics/edits" endpoints? Thanks!

4 10

Oversampled data within the NavigationTiming EL stream
by Ian Marlier 23 Mar '18

23 Mar '18

General notification in case there are others consuming from eventlogging_NavigationTiming: The performance team recently instituted oversampling of data based on configurable criteria. This means that in some cases, the data stream on this topic may not be representative of wiki users generally. If you wish to parse NavigationTiming data in a representative way, you should check the attribute 'is_oversample' in the event object, and filter out the message if true. (In the event that a single page load is part of the regular sample as well, two messages will be emitted with the same data, but with different values for the is_oversample parameter.) Please let me know if you have any questions. - Ian

1 0

Research Showcase March 21, 2018 (11:30 AM PDT | 18:30 UTC)
by Sarah R 21 Mar '18

21 Mar '18

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, March 21, 2018 at 11:30 AM (PDT) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=ACevHs0sMMw As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2018>. Over the past years, the Research team at Wikimedia Foundation and some of our formal collaborators have been focused on doing research and building technologies that can help editors across Wikimedia languages find tasks for contributions. While the early effort was heavily focused on article recommendation for creation (horizontal expansion), in 2016 we started a new direction of research with a focus on vertical expansion of Wikipedia articles. The two talks in the March 2018 Research Showcase will share some of what we have learned from this research. More specifically, we will talk about Wikipedia category network as a great signal for creating templates/structures for Wikipedia articles as well as ongoing research to learn what content (sections) are missing from Wikipedia across its many languages. The two corresponding abstracts with more details are below. Join us! :) Using Wikipedia categories for research: opportunities, challenges, and solutionsBy *Tiziano Piccardi, EPFL*The category network in Wikipedia is used by editors as a way to label articles and organize them in a hierarchical structure. This manually created and curated network of 1.6 million nodes in English Wikipedia generated by arranging the categories in a child-parent relation (i.e., Scientists-People, Cities-Human Settlement) allows researchers to infer valuable relations between concepts. A clean structure in this format would be a valuable resource for a variety of tools and application including automatic reasoning tools. Unfortunately, Wikipedia category network contains some "noise" since in many cases the association as subcategory does not define an is-a relation (Scientists is-a People vs. Billionaires‎ is-a Wealth). Inspired to develop a model for recommending sections to be added to the already existing Wikipedia articles, we developed a method to clean this network and to keep only the categories that have a high chance to be associated with their children by an is-a relation. The strategy is based on the concept of "pure" categories, and the algorithm uses the types of the attached articles to determine how homogenous the category is. The approach does not rely on any linguistic feature and therefore is suitable for all Wikipedia languages. In this talk, we will discuss the high-level overview of the algorithm and some of the possible applications for the generated network beyond article section recommendations. Beyond Automatic Translation: Aligning Wikipedia sections across multiple languagesBy *Diego Saez-Trumper*Sections are the building blocks of Wikipedia articles. For editors, they can be used as an entry point for creating and expanding articles. For readers, they enhance readability of Wikipedia content. In this talk, we present an ongoing research to align article sections across Wikipedia languages. We show how the available technology for automatic translations are not good enough for translating section titles. We then show a complementary approach for section alignment, using Wikidata and cross-lingual word embeddings. We will present some of the use-cases of a methodology for aligning sections across languages, including improved section recommendation, especially in medium to smaller size languages where the language itself may not contain enough signal about the structure of the articles and signals can be inferred from other larger Wikipedia languages. Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

1 1

Wikipedia internal search clickstream
by Georg Sorst 15 Mar '18

15 Mar '18

Hi list, as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer. This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose. I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries. Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences. Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo] Jakob-Haringer-Str. 5a | 5020 <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&en…> Salzburg <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&en…> I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst(a)findologic.com www.findologic.com Folgen Sie uns auf: XING <https://www.xing.com/profile/Georg_Sorst> facebook <http://www.facebook.com/Findologic/> Twitter <https://twitter.com/findologic> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier <http://www.findologic.com> geht es zu unserer *Homepage*!

3 9

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2018