Hello everyone,

The next Research Showcase, focused on Improving knowledge integrity in Wikimedia projects, will be live-streamed Wednesday, July 19, at 9:30 AM PST / 16:30 UTC. Find your local time here.

The event is on the WMF Staff Calendar.

YouTube stream: https://youtube.com/live/_8DevIsi44s?feature=share

You can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase

This month's presentations:

Assessment of Reference Quality on Wikipedia

By Aitolkyn Baigutanova, KAIST

In this talk, I will present our research on the reliability of Wikipedia through the lens of its references. I will primarily discuss our paper on the longitudinal assessment of reference quality on English Wikipedia, where we operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. I will share our research findings on two key aspects: (1) the evolution of reference quality over a 10-year period and (2) factors that affect reference quality. We discover that the RN score has dropped by 20 percent point, with more than half of verifiable statements now accompanying references. The RR score has remained below 1% over the years as a result of the efforts of the community to eliminate unreliable references. As an extension of this work, we explore how community initiatives, such as the perennial source list, help with maintaining reference quality across multiple language editions of Wikipedia. We hope our work encourages more active discussions within Wikipedia communities to improve reference quality of the content.

Paper: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839.

Multilingual approaches to support knowledge integrity in Wikipedia

By Diego Saez-Trumper & Pablo Aragón, Wikimedia Foundation

Knowledge integrity in Wikipedia is key to ensure the quality and reliability of information. For that reason, editors devote a substantial amount of their time in patrolling tasks in order to detect low-quality or misleading content. In this talk we will cover recent multilingual approaches to support knowledge integrity. First, we will present a novel design of a system aimed at assisting the Wikipedia communities in addressing vandalism. This system was built by collecting a massive dataset of multiple languages and then applying advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. Second, we will showcase the Wikipedia Knowledge Integrity Risk Observatory, a dashboard that relies on a language-agnostic version of the former system to monitor high risk content in hundreds of Wikipedia language editions. We will conclude with a discussion of different challenges to be addressed in future work.

Papers:

Trokhymovych, M., Aslam, M., Chou, A. J., Baeza-Yates, R., & Saez-Trumper, D. (2023). Fair multilingual vandalism detection system for Wikipedia. arXiv e-prints, arXiv-2306. https://arxiv.org/pdf/2306.01650.pdf

Aragón, P., & Sáez-Trumper, D. (2021). A preliminary approach to knowledge integrity risk assessment in Wikipedia projects. arXiv preprint arXiv:2106.15940.

Best,

Kinneret

Kinneret Gordon

Senior Research Community Officer

Wikimedia Foundation