Analytics February 2020

analytics@lists.wikimedia.org

18 participants
13 discussions

Wiki Workshop 2020 Announcement and Call for Papers
by Leila Zia 18 Apr '20

18 Apr '20

Hi everyone, We are delighted to announce that Wiki Workshop 2020 will be held in Taipei on April 20 or 21, 2020 (the date to be finalized soon) and as part of the Web Conference 2020 [1]. In the past years, Wiki Workshop has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San Francisco. You can read more about the call for papers and the workshops at http://wikiworkshop.org/2020/#call. Please note that the deadline for the submissions to be considered for proceedings is January 17. All other submissions should be received by February 21. If you have questions about the workshop, please let us know on this list or at wikiworkshop(a)googlegroups.com. Looking forward to seeing you in Taipei. Best, Miriam Redi, Wikimedia Foundation Bob West, EPFL Leila Zia, Wikimedia Foundation [1] https://www2020.thewebconf.org/

2 6

Community health metrics kit: Input needed!
by Joe Sutherland 27 Feb '20

27 Feb '20

Hello everyone - apologies for cross-posting! *TL;DR*: We would like your feedback on our Metrics Kit project. Please have a look and comment on Meta-Wiki: https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit The Wikimedia Foundation's Trust and Safety team, in collaboration with the Community Health Initiative, is working on a Metrics Kit designed to measure the relative "health"[1] of various communities that make up the Wikimedia movement: https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit The ultimate outcome will be a public suite of statistics and data looking at various aspects of Wikimedia project communities. This could be used by both community members to make decisions on their community direction and Wikimedia Foundation staff to point anti-harassment tool development in the right direction. We have a set of metrics we are thinking about including in the kit, ranging from the ratio of active users to active administrators, administrator confidence levels, and off-wiki factors such as freedom to participate. It's ambitious, and our methods of collecting such data will vary. Right now, we'd like to know: * Which metrics make sense to collect? Which don't? What are we missing? * Where would such a tool ideally be hosted? Where would you normally look for statistics like these? * We are aware of the overlap in scope between this and Wikistats < https://stats.wikimedia.org/v2/#/all-projects> — how might these tools coexist? Your opinions will help to guide this project going forward. We'll be reaching out at different stages of this project, so if you're interested in direct messaging going forward, please feel free to indicate your interest by signing up on the consultation page. Looking forward to reading your thoughts. best, Joe P.S.: Please feel free to CC me in conversations that might happen on this list! [1] What do we mean by "health"? There is no standard definition of what makes a Wikimedia community "healthy", but there are many indicators that highlight where a wiki is doing well, and where it could improve. This project aims to provide a variety of useful data points that will inform community decisions that will benefit from objective data. -- *Joe Sutherland* (he/him or they/them) Trust and Safety Specialist Wikimedia Foundation joesutherland.rocks

7 8

Tutorials on disk space usage for notebook/stat boxes
by Luca Toscano 26 Feb '20

26 Feb '20

Hi everybody! I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_No… It contains two FAQ: - How do I ensure that there is enough space on disk before storing big datasets/files ? - How do I check the space used by my files/data on stat/notebook hosts ? Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources. On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same. Thanks! Luca (on behalf of the Analytics team)

8 8

SparkContext stopped and cannot be restarted
by Neil Shah-Quinn 25 Feb '20

25 Feb '20

Hey there! I was running SQL queries via PySpark (using the wmfdata package <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space". After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext." When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error". Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover. Thank you!

5 13

Fwd: [Wikitech-l] [Wikimedia Technical Talks] Data and Decision Science at Wikimedia with Kate Zimmerman, 26 February 2020 @ 6PM UTC
by Pine W 24 Feb '20

24 Feb '20

Hello colleagues, I'm forwarding this announcement to additional email lists. Most public WMF meetings that are livestreamed on Youtube remain available for replay after the meeting, and I'm guessing that this one will be also. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message --------- From: Srishti Sethi <ssethi(a)wikimedia.org> Date: Mon, Feb 24, 2020 at 8:59 PM Subject: Re: [Wikitech-l] [Wikimedia Technical Talks] Data and Decision Science at Wikimedia with Kate Zimmerman, 26 February 2020 @ 6PM UTC To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hello folks, Just a reminder that this talk will take place Wednesday 26 February 2020 at 6 PM UTC. Hope to see you there! Cheers, Srishti *Srishti Sethi* Developer Advocate Wikimedia Foundation <https://wikimediafoundation.org/> On Tue, Feb 18, 2020 at 2:57 PM Sarah R <srodlund(a)wikimedia.org> wrote: > Hello Everyone, > > It's time for Wikimedia Tech Talks 2020 Episode 1! This talk will take > place on *26 February 2020 at 6 PM UTC*. > > This month's talk will be in an interview format. You are invited to send > questions ahead of time by replying to this email, or you can ask during Q > & A section of the live talk by asking through IRC or the Youtube > Livestream. > > Title: Data and Decision Science at Wikimedia > > Speaker: Kate Zimmerman, Head of Product Analytics at Wikimedia > > Summary: > > How do teams at the Foundation use data to inform decisions? > > Sarah R. Rodlund talks with Kate Zimmerman, Head of Product Analytics at > Wikimedia, about what sorts of data her team uses and how insights from > their analysis have shaped product decisions. > > Kate Zimmerman holds an MS in Psychology & Behavioral Decision Research > from Carnegie Mellon University and has over 15 years of experience in > quantitative and experimental methods. Before joining Wikimedia, she built > data teams from scratch at ModCloth and SmugMug, evolving their data > capabilities from basic reports to strategic analysis, automated > dashboards, and advanced modeling. > > The link to the Youtube Livestream can be found here: > https://www.youtube.com/watch?v=J-CRsiwYM9w > > During the live talk, you are invited to join the discussion on IRC at > #wikimedia-office > > You can watch past Tech Talks here: > https://www.mediawiki.org/wiki/Tech_talks > > If you are interested in giving your own tech talk, you can learn more > here: > > https://www.mediawiki.org/wiki/Project:Calendar/How_to_schedule_an_event#Te… > > Note: This is a public talk. Feel free to distribute through appropriate > email and social channels! > > Many kindnesses, > > Sarah R. Rodlund > Technical Writer, Developer Advocacy > srodlund(a)wikimedia.org > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

Analytics/Research Office hours on 2020-02-26 at 17.00-18.00 (UTC)
by Martin Gerlach 21 Feb '20

21 Feb '20

Hi all, join us for our monthly Analytics/Research Office hours on 2020-02-26 at 17.00-18.00 (UTC). Bring all your research questions and ideas to discuss projects, data, analysis, etc… To participate, please join the IRC channel: #wikimedia-research [1]. More detailed information can be found here [2] or on the etherpad [3] if you would like to add items to agenda or check notes from previous meetings. Best, Martin [1] irc://chat.freenode.net:6667/wikimedia-research [2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours -- Martin Gerlach Research Scientist Wikimedia Foundation

1 0

Changes to session IDs in MediaWiki
by Jason Linehan 19 Feb '20

19 Feb '20

If your team uses mw.user.sessionId() for instrumentation, a recent change to MediaWiki could impact your numbers. The new patch <https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/572011/> changes the way that session IDs work, bringing their behavior closer to other platforms that many of us are familiar with. The value returned from mw.user.sessionId() will now: - be the same in different tabs of the same browser process - be the same in different windows of the same browser process - be forgotten once the browser process ends Since 2017, values returned from mw.user.sessionId() have only been constant within the same browser tab, and only lasted until the tab was closed. This had gone unnoticed until recently. See T223931 <https://phabricator.wikimedia.org/T223931> for more details. This patch restores pre-2017 behavior. If you have any questions about the change, or if you notice any irregularities in your data or instrumentation, reach out or tag jlinehan, mpopov, or the Better Use of Data topic on Phabricator. -Jason

1 0

[Wikimedia Research Showcase] February 19, 2020: The Humans and Bots of Wikipedia and Wikidata
by Janna Layton 19 Feb '20

19 Feb '20

Hi all, The next Research Showcase will be live-streamed on Wednesday, February 19, at 9:30 AM PST/17:30 UTC. We’ll have presentations from Jeffrey V. Nickerson on human/machine collaboration on Wikipedia, and Lucie-Aimée Kaffee on human/machine collaboration on Wikidata. A question-and-answer session will follow. YouTube stream: https://www.youtube.com/watch?v=fj0z20PuGIk As usual, you can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase This month's presentations: Autonomous tools and the design of work By Jeffrey V. Nickerson, Stevens Institute of Technology Bots and other software tools that exhibit autonomy can appear in an organization to be more like employees than commodities. As a result, humans delegate to machines. Sometimes the machines turn and delegate part of the work back to humans. This talk will discuss how the design of human work is changing, drawing on a recent study of editors and bots in Wikipedia, as well as a study of game and chip designers. The Wikipedia bot ecosystem, and how bots evolve, will be discussed. Humans are working together with machines in complex configurations; this puts constraints on not only the machines but also the humans. Both software and human skills change as a result. Paper <https://dl.acm.org/doi/pdf/10.1145/3359317?download=true> When Humans and Machines Collaborate: Cross-lingual Label Editing in Wikidata By Lucie-Aimée Kaffee, University of Southampton The quality and maintainability of any knowledge graph are strongly influenced in the way it is created. In the case of Wikidata, the knowledge graph is created and maintained by a hybrid approach of human editing supported by automated tools. We analyse the editing of natural language data, i.e. labels. Labels are the entry point for humans to understand the information, and therefore need to be carefully maintained. Wikidata is a good example for a hybrid multilingual knowledge graph as it has a large and active community of humans and bots working together covering over 300 languages. In this work, we analyse the different editor groups and how they interact with the different language data to understand the provenance of the current label data. This presentation is based on the paper “When Humans and Machines Collaborate: Cross-lingual Label Editing in Wikidata”, published in OpenSym 2019 in collaboration with Kemele M. Endris and Elena Simperl. Paper <https://opensym.org/wp-content/uploads/2019/08/os19-paper-A16-kaffee.pdf> -- Janna Layton (she, her) Administrative Assistant - Product & Technology Wikimedia Foundation <https://wikimediafoundation.org/>

1 2

Re: [Analytics] [Wiki-research-l] Announcement - Mediawiki History Dumps
by Joseph Allemandou 18 Feb '20

18 Feb '20

Hi Giovanni, The pagelinks table is great for temporal snapshots: you know about links between pages at the time of the query. Parsing the wikitext is needed to provide an historical view of the links :) Cheers Joseph On Tue, Feb 18, 2020 at 12:22 AM Giovanni Luca Ciampaglia <glc3(a)mail.usf.edu> wrote: > Thank you Joseph; great to hear there is interest in building such a > dataset. You say that the link information would need to be parsed from > wikitext, which is complicated; would the pagelinks table help as an > alternative source of data? > > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com > Assistant Professor > Computer Science and Engineering > <https://www.usf.edu/engineering/cse/> ∙ University > of South Florida <https://www.usf.edu/> > > *Due to Florida’s broad open records law, email to or from university > employees is public record, available to the public and the media upon > request.* > > > On Thu, Feb 13, 2020 at 9:27 AM Joseph Allemandou < > jallemandou(a)wikimedia.org> > wrote: > > > Hi Giovanni, > > Thank you for your message :) > > You are correct in that there is no information on page-to-page link as > of > > today, as well as no information for instance on historical values of > > revisions being redirects for instance. > > We share with you the idea that such information is extremely valuable, > and > > we have in mind to be able to extract it at some point. > > The reason for which it has not yet been done is because those pieces > > of information are only available through parsing the wikitext of every > > revision, which is not only resource intensive but also complicated > > technically (templates, version changes etc). > > You can be sure we will send another announcement when we'll release that > > data :) > > Best, > > > > On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia < > > glc3(a)mail.usf.edu> > > wrote: > > > > > Hi Joseph, > > > > > > Thanks a lot for creating and sharing such a valuable resource. I went > > > through the schema and from what I understand there is no information > > about > > > page-to-page links, correct? Are there any resources that would provide > > > such historical data? > > > > > > Best, > > > > > > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com > > > Assistant Professor > > > Computer Science and Engineering > > > <https://www.usf.edu/engineering/cse/> ∙ University > > > of South Florida <https://www.usf.edu/> > > > > > > *Due to Florida’s broad open records law, email to or from university > > > employees is public record, available to the public and the media upon > > > request.* > > > > > > > > > On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou < > > > jallemandou(a)wikimedia.org> wrote: > > > > > > > Hi Analytics People, > > > > > > > > The Wikimedia Analytics Team is pleased to announce the release of > the > > > most > > > > complete dataset we have to date to analyze content and contributors > > > > metadata: Mediawiki History [1] [2]. > > > > > > > > Data is in TSV format, released monthly around the 3rd of the month > > > > usually, and every new release contains the full history of metadata. > > > > > > > > The dataset contains an enhanced [3] and historified [4] version of > > user, > > > > page and revision metadata and serves as a base to Wiksitats API on > > > edits, > > > > users and pages [5] [6]. > > > > > > > > We hope you will have as much fun playing with the data as we have > > > building > > > > it, and we're eager to hear from you [7], whether for issues, ideas > or > > > > usage of the data. > > > > > > > > Analytically yours, > > > > > > > > -- > > > > Joseph Allemandou (joal) (he / him) > > > > Sr Data Engineer > > > > Wikimedia Foundation > > > > > > > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html > > > > [2] > > > > > > > > > > > > > > https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his… > > > > [3] Many pre-computed fields are present in the dataset, from > > edit-counts > > > > by user and page to reverts and reverted information, as well as time > > > > between events. > > > > [4] As accurate as possible historical usernames and page-titles (as > > well > > > > as user-groups and blocks) is available in addition to current > values, > > > and > > > > are provided in a denormalized way to every event of the dataset. > > > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 > > > > [6] https://wikimedia.org/api/rest_v1/ > > > > [7] > > > > > > > > > > > > > > https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi… > > > > _______________________________________________ > > > > Wiki-research-l mailing list > > > > Wiki-research-l(a)lists.wikimedia.org > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > > _______________________________________________ > > > Wiki-research-l mailing list > > > Wiki-research-l(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > > > > -- > > Joseph Allemandou (joal) (he / him) > > Sr Data Engineer > > Wikimedia Foundation > > _______________________________________________ > > Wiki-research-l mailing list > > Wiki-research-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation

1 0

Announcement - Mediawiki History Dumps
by Joseph Allemandou 17 Feb '20

17 Feb '20

Hi Analytics People, The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2]. Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata. The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6]. We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data. Analytically yours, -- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his… [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events. [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset. [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi…

4 5

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2020