Hi all,

The next Research Showcase, featuring the recipients of this year's Wikimedia Foundation Research Awards of the Year, will be live-streamed Wednesday, July 20, at 9:30 AM PST/16:30 UTC. Find your local time here

YouTube stream: https://www.youtube.com/watch?v=KMvXOQU5fX4 

You are welcome to ask questions via YouTube chat or on IRC at #wikimedia-research. 

This month's presentations: 

Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
By Krishna Srinivasan (Google)
The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information across image and text modalities. In this talk, I introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.5 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.

WIT’s unique advantages include: WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). WIT is massively multilingual (first of its kind) with coverage over 100+ languages. WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover.

WIT Dataset is available for download and use via a Creative Commons license here: https://github.com/google-research-datasets/wit

I conclude the talk with future directions to expand and extend the WIT dataset. Link to paperː https://arxiv.org/pdf/2103.01913.pdf

Assessing the Quality of Sources in Wikidata Across Languages
By Gabriel Amaral (King's College London)
Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata’s ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. Link to paperː https://dl.acm.org/doi/abs/10.1145/3484828

You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase


Emily, on behalf of the Research team


--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation