Wikimedia Research Showcase July 20 - Wiki-research-l

14 Jul 2022


      Hi all,
The next Research Showcase, featuring the recipients of this year's
Wikimedia Foundation Research Awards of the Year, will be live-streamed
Wednesday, July 20, at 9:30 AM PST/16:30 UTC. Find your local time here
https://zonestamp.toolforge.org/1658334607.
YouTube stream: https://www.youtube.com/watch?v=KMvXOQU5fX4
https://www.youtube.com/watch?v=KMvXOQU5fX4
You are welcome to ask questions via YouTube chat or on IRC at
#wikimedia-research.
This month's presentations:
Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine
LearningBy *Krishna Srinivasan (Google)*The milestone improvements brought
about by deep representation learning and pre-training techniques have led
to large performance gains across downstream NLP, IR and Vision tasks.
Multimodal modeling techniques aim to leverage large high-quality
visio-linguistic datasets for learning complementary information across
image and text modalities. In this talk, I introduce the Wikipedia-based
Image Text (WIT) Dataset to better facilitate multimodal, multilingual
learning. WIT is composed of a curated set of 37.5 million entity rich
image-text examples with 11.5 million unique images across 108 Wikipedia
languages.
WIT’s unique advantages include: WIT is the largest multimodal dataset by
the number of image-text examples by 3x (at the time of writing). WIT is
massively multilingual (first of its kind) with coverage over 100+
languages. WIT represents a more diverse set of concepts and real world
entities relative to what previous datasets cover.
WIT Dataset is available for download and use via a Creative Commons
license here: https://github.com/google-research-datasets/wit
I conclude the talk with future directions to expand and extend the WIT
dataset. Link to paperː https://arxiv.org/pdf/2103.01913.pdf
Assessing the Quality of Sources in Wikidata Across LanguagesBy *Gabriel
Amaral (King's College London)*Wikidata is one of the most important
sources of structured data on the web, built by a worldwide community of
volunteers. As a secondary source, its contents must be backed by credible
references; this is particularly important as Wikidata explicitly
encourages editors to add claims for which there is no broad consensus, as
long as they are corroborated by references. Nevertheless, despite this
essential link between content and references, Wikidata’s ability to
systematically assess and assure the quality of its references remains
limited. To this end, we carry out a mixed-methods study to determine the
relevance, ease of access, and authoritativeness of Wikidata references, at
scale and in different languages, using online crowdsourcing, descriptive
statistics, and machine learning. The findings help us ascertain the
quality of references in Wikidata, and identify common challenges in
defining and capturing the quality of user-generated multilingual
structured data on the web. Link to paperː
https://dl.acm.org/doi/abs/10.1145/3484828
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Emily, on behalf of the Research team
-- 
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation