Analytics May 2020

analytics@lists.wikimedia.org

10 participants
7 discussions

43k monthly active editors on the English Wikipedia
by Federico Leva (Nemo) 28 May '20

28 May '20

The English Wikipedia is showing a pattern that I don't notice on several other wikis. If I'm not mistaken, in April 2020 monthly active editors passed 43k for the first time since 2011 (the year when MobileFrontend was created). <https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors/…> (As usual there will be a deflation of the number in a few months, after the deletions have run their course. The 43k threshold may still hold.) The April peak looks like it continued and reinforced one of the now-usual October/January/March peaks. Do we know how much of this growth is organic or across the board and how much is amplification of existing known seasonal patterns (WikiEdu?). Federico

2 1

Analytics/Research Office hours 2020-05-27 at 18.00-19.00 (UTC)
by Martin Gerlach 27 May '20

27 May '20

Hi all, join us for our monthly Analytics/Research Office hours next Wednesday, 2020-05-27 at 18.00-19.00 (UTC)*. Bring all your research questions and ideas to discuss projects, data, analysis, etc… To participate, please join the IRC channel: #wikimedia-research [1]. More detailed information can be found here [2] or on the etherpad [3] if you would like to add items to agenda or check notes from previous meetings. Best, Martin [1] irc://chat.freenode.net:6667/wikimedia-research [2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours * find local times here: https://www.timeanddate.com/worldclock/fixedtime.html?iso=20200527T18&sort=2 -- Martin Gerlach Research Scientist Wikimedia Foundation

1 1

Superset upgraded to 0.36
by Luca Toscano 27 May '20

27 May '20

Hi everybody, Superset has been upgraded to 0.36 (latest upstream)! <https://phabricator.wikimedia.org/T249495> Please note: if I got it correctly some charts are now requiring an explicit time window to work correctly, -infinite - +infinite seems not suggested anymore, so if you see your charts looking strange (like all datapoints collapsed etc..) please check time ranges :) If you see any issue or regression please report it in https://phabricator.wikimedia.org/T249495 Thanks! Luca (on behalf of the Analytics team)

1 0

[Announcement] Daily Social Media Traffic Report for English Wikipedia articles
by Jonathan Morgan 21 May '20

21 May '20

The WMF Research team has published a new pageview report of inbound traffic coming from Facebook, Twitter, YouTube, and Reddit.[1] The report contains a list of all articles that received at least 500 views from one or more of these platforms (i.e. someone clicked a link on Twitter that sent them directly to a Wikipedia article). The report is available on-wiki and will be updated daily at around 14:00 UTC with traffic counts from the previous calendar day. We believe this report provides editors with a valuable new information source. Daily inbound social media traffic stats can help editors monitor edits to articles that are going viral on social media sites and/or are being linked to by the social media platform itself in order to fact-check disinformation and other controversial content[2][3]. The social media traffic report also contains additional public article metadata that may be useful in the context of monitoring articles that are receiving unexpected attention from social media sites, such as... - the total number of pageviews (from all sources) that article received in the same period of time - the number of pageviews the article received from the same platform (e.g. Facebook) the previous day (two days ago) - the number of editors who have the page on their watchlist - the number of editors who have watchlisted the page AND recently visited it We want your feedback! We have some ideas of our own for how to improve the report, but we want to hear yours! If you have feature suggestions, please add them here.[4] We intend to maintain this daily report for at least the next two months. If we receive feedback that the report is useful, we are considering making it available indefinitely. If you have other questions about the report, please first check out our (still growing) FAQ [5]. All questions, comments, concerns, ideas, etc. are welcome on the project talkpage on Meta.[4] 1. https://en.wikipedia.org/wiki/User:HostBot/Social_media_traffic_report 2. https://www.engadget.com/2018/03/15/wikipedia-unaware-would-be-youtube-fact… 3. https://mashable.com/2017/10/05/facebook-wikipedia-context-articles-news-fe… 4. https://meta.wikimedia.org/wiki/Research_talk:Social_media_traffic_report_p… 5. https://meta.wikimedia.org/wiki/Research:Social_media_traffic_report_pilot/… Cheers, Jonathan -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> (Uses He/Him) *Please note that I do not expect a response from you on evenings or weekends*

4 3

[Wikimedia Research Showcase] May 20, 2020: Human in the Loop Machine Learning
by Janna Layton 20 May '20

20 May '20

Hi all, The next Research Showcase will be live-streamed on Wednesday, May 20, at 9:30 AM PDT/16:30 UTC. This month we will learn about recent research on machine learning systems that rely on human supervision for their learning and optimization -- a research area commonly referred to as Human-in-the-Loop ML. In the first talk, Jie Yang will present a computational framework that relies on crowdsourcing to identify influencers in Social Networks (Twitter) by selectively obtaining labeled data. In the second talk, Estelle Smith will discuss the role of the community in maintaining ORES, the machine learning system that predicts the quality in Wikipedia applications. YouTube stream: https://www.youtube.com/watch?v=8nDiu2ebdOI As usual, you can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase This month's presentations: *OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation* By: Jie Yang, Amazon (current), Delft University of Technology (starting soon) Finding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality. To tackle those issues, we present OpenCrowd, a unified Bayesian framework that seamlessly incorporates machine learning and crowdsourcing for effectively finding social influencers. To infer a set of influencers, OpenCrowd bootstraps the learning process using a small number of expert labels and then jointly learns a feature-based answer quality model and the reliability of the workers. Model parameters and worker reliability are updated iteratively, allowing their learning processes to benefit from each other until an agreement on the quality of the answers is reached. We derive a principled optimization algorithm based on variational inference with efficient updating rules for learning OpenCrowd parameters. Experimental results on finding social influencers in different domains show that our approach substantially improves the state of the art by 11.5% AUC. Moreover, we empirically show that our approach is particularly useful in finding micro-influencers, who are very directly engaged with smaller audiences. Paper: https://dl.acm.org/doi/fullHtml/10.1145/3366423.3380254 *Keeping Community in the Machine-Learning Loop* By: C. Estelle Smith, MS, PhD Candidate, GroupLens Research Lab at the University of Minnesota On Wikipedia, sophisticated algorithmic tools are used to assess the quality of edits and take corrective actions. However, algorithms can fail to solve the problems they were designed for if they conflict with the values of communities who use them. In this study, we take a Value-Sensitive Algorithm Design approach to understanding a community-created and -maintained machine learning-based algorithm called the Objective Revision Evaluation System (ORES)—a quality prediction system used in numerous Wikipedia applications and contexts. Five major values converged across stakeholder groups that ORES (and its dependent applications) should: (1) reduce the effort of community maintenance, (2) maintain human judgement as the final authority, (3) support differing peoples’ differing workflows, (4) encourage positive engagement with diverse editor groups, and (5) establish trustworthiness of people and algorithms within the community. We reveal tensions between these values and discuss implications for future research to improve algorithms like ORES. Paper: https://commons.wikimedia.org/wiki/File:Keeping_Community_in_the_Loop-_Unde… -- Janna Layton (she, her) Administrative Assistant - Product & Technology Wikimedia Foundation <https://wikimediafoundation.org/>

1 2

"automated" marker added to pageview data
by Nuria Ruiz 18 May '20

18 May '20

Hello: We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'. We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API. The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%). If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282 Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find. Thanks, Nuria [1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetectio…

4 5

Decommissioning path for notebook100[3, 4] starting now - Please read if you use those hosts
by Luca Toscano 07 May '20

07 May '20

Hello everybody, I am pretty sure that the first thought reading the email's subject was "whatttttttttt!!!!???? Are you crazy Luca?". There is a valid reason, please keep reading :D In T243934 we worked a lot to reduce the complexity of our client nodes, and the result is that all stat100x have the same configuration (please see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients), that includes Jupyterhub. SSH access to nodes was also simplified a lot, see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups. The notebook100x hosts have the downside of having little disk space for home directories, and it would be great to reduce the amount of hosts handled by Analytics. This is why I am announcing that at the beginning of June I'll start decommissioning the notebook100x nodes. I have added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Rsync_between… as reference to a quick and easy way to transfer data between notebook and stat boxes, but please feel free to reach out to us in https://phabricator.wikimedia.org/T249752 for any issue/doubt/etc.. Thanks in advance! Luca (on behalf of the Analytics team)

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2020