Hi all,

The next Research Showcase will be live-streamed Wednesday, March 16 at 6:30AM PT / 13:30 UTC. Find your local time here: https://zonestamp.toolforge.org/1647437436.

The theme is: Patterns and dynamics of article quality.

YouTube stream: https://www.youtube.com/watch?v=o5e6S7ac4q4

You can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase.

The Showcase will feature the following talks:

Quality monitoring in Wikipedia - A computational perspective

By Animesh Mukherjee (Indian Institute of Technology, Kharagpur)

In this talk, I shall summarize our five-year long research highlights concerning Wikipedia. In particular, I shall deep dive into two of our recent works; while the first one attempts to understand the early indications of which editors would soon go "missing" (aka missing editors) [1], the second one investigates how the quality of a Wikipedia article transitions over time and whether computational models could be built to understand the characteristics of future transitions [2]. In each case, I will present a suite of key results and the main insights that we obtained thereof.

[1] When expertise gone missing: Uncovering the loss of prolific contributors in Wikipedia, ICADL 2021 (pdf)

[2] Quality Change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia, CSCW 2022 (pdf)

Automatically Labeling Low Quality Content on Wikipedia by Leveraging Editing Behaviors

By Sumit Asthana (University of Michigan, Ann Arbor)

Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current labeling approaches are tedious and produce noisy labels. In this talk, I will discuss an automated labeling approach that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia edits and uses the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated article sentences are examples that no longer need semantic improvements. I will discuss the performance of models training with this labeling approach over models trained with existing labeling approaches, and also the implications of such a large scale semi supervised labeling approach in capturing the editing practices of Wikipedia editors and helping them improve articles faster.

Emily Lescak (she / her)

Senior Research Community Officer

The Wikimedia Foundation