Hi all,
The next Research Showcase, featuring the recipients of this year's
Wikimedia Foundation Research Awards of the Year, will be live-streamed
Wednesday, July 20, at 9:30 AM PST/16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1658334607>.
YouTube stream: https://www.youtube.com/watch?v=KMvXOQU5fX4
<https://www.youtube.com/watch?v=KMvXOQU5fX4>
You are welcome to ask questions via YouTube chat or on IRC at
#wikimedia-research.
This month's presentations:
Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine
LearningBy *Krishna Srinivasan (Google)*The milestone improvements brought
about by deep representation learning and pre-training techniques have led
to large performance gains across downstream NLP, IR and Vision tasks.
Multimodal modeling techniques aim to leverage large high-quality
visio-linguistic datasets for learning complementary information across
image and text modalities. In this talk, I introduce the Wikipedia-based
Image Text (WIT) Dataset to better facilitate multimodal, multilingual
learning. WIT is composed of a curated set of 37.5 million entity rich
image-text examples with 11.5 million unique images across 108 Wikipedia
languages.
WIT’s unique advantages include: WIT is the largest multimodal dataset by
the number of image-text examples by 3x (at the time of writing). WIT is
massively multilingual (first of its kind) with coverage over 100+
languages. WIT represents a more diverse set of concepts and real world
entities relative to what previous datasets cover.
WIT Dataset is available for download and use via a Creative Commons
license here: https://github.com/google-research-datasets/wit
I conclude the talk with future directions to expand and extend the WIT
dataset. Link to paperː https://arxiv.org/pdf/2103.01913.pdf
Assessing the Quality of Sources in Wikidata Across LanguagesBy *Gabriel
Amaral (King's College London)*Wikidata is one of the most important
sources of structured data on the web, built by a worldwide community of
volunteers. As a secondary source, its contents must be backed by credible
references; this is particularly important as Wikidata explicitly
encourages editors to add claims for which there is no broad consensus, as
long as they are corroborated by references. Nevertheless, despite this
essential link between content and references, Wikidata’s ability to
systematically assess and assure the quality of its references remains
limited. To this end, we carry out a mixed-methods study to determine the
relevance, ease of access, and authoritativeness of Wikidata references, at
scale and in different languages, using online crowdsourcing, descriptive
statistics, and machine learning. The findings help us ascertain the
quality of references in Wikidata, and identify common challenges in
defining and capturing the quality of user-generated multilingual
structured data on the web. Link to paperː
https://dl.acm.org/doi/abs/10.1145/3484828
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Emily, on behalf of the Research team
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello,
I am one of the test engineers on the QTE team.
There is a plan to migrate the MediaWiki software on production to
Kubernetes.
In preparation for this, we will be migrating test2wiki to Kubernetes
first so that QTE can test it and catch any bugs before the wider
roll-out.
I am trying to identify areas of our software for which the migration to
Kubernetes might pose a risk.
I wonder if this might be true of any of the software you are
responsible for. In particular, I am thinking about where MediaWiki is
interacting with different services in our ecosystem. I don't know
enough about this area to make an informed judgement.
Any ideas about what might be risky and in need of testing, and how one
might go about testing it on test2wiki
(https://test2.wikipedia.org/wiki/Main_Page) would be of great help to
me.
Let me know if you have any questions.
Thank you,
Dom