[Wikimedia Research Showcase] Machine Translation on Wikipedia- July 24 at 16:30 UTC - Analytics

19 Jul 2024


      Hi all,
The next Research Showcase will be live-streamed next Wednesday, July 24,
at 9:30 AM PST / 16:30 UTC. Find your local time here
https://zonestamp.toolforge.org/1721838600. The theme for this showcase is
 *Machine Translation on Wikipedia*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
The Promise and Pitfalls of AI Technology in Bridging Digital Language
DivideBy *Kai Zhu, Bocconi University*Machine translation technologies have
the potential to bridge knowledge gaps across languages, promoting more
inclusive access to information regardless of native languages. This study
examines the impact of integrating Google Translate into Wikipedia's
Content Translation system in January 2019. Employing a natural experiment
design and difference-in-differences strategy, we analyze how this
translation technology shock influenced the dynamics of content production
and accessibility on Wikipedia across over a hundred languages. We find
that this technology integration leads to a 149% increase in content
production through translation, driven by existing editors becoming more
productive as well as an expansion of the editor base. Moreover, we observe
that machine translation enhances the propagation of biographical and
geographical information, helping to close these knowledge gaps in the
multilingual context. However, our findings also underscore the need for
continued efforts to mitigate the preexisting systemic barriers. Our study
contributes to our knowledge on the evolving role of artificial
intelligence in shaping knowledge dissemination through enhanced language
translation capabilities.Implications of Using Inorganic Content in Arabic
Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, Clarkson
University*Wikipedia articles (content pages) are one of the widely
utilized training corpora for NLP tasks and systems, yet these articles are
not always created, generated, or even edited organically by native
speakers; some are automatically created, generated, or translated using
Wikipedia bots or off-the-shelf translation tools like Google Translate
without human revision or supervision. We first analyzed the three Arabic
Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic
(ARY), and found that these Arabic Wikipedia editions suffer from a few
serious issues, like large-scale automatic creations and translations from
English to Arabic, all without human involvement, generating content
(articles) that lack not only linguistic richness and diversity but also
content that lacks cultural richness and meaningful representation of the
Arabic language and its native speakers. We second studied the performance
implications of using such inorganic, unrepresentative articles to train
NLP tasks or systems, where we intrinsically evaluated the performance of
two main NLP upstream tasks, namely word representation and language
modeling, using word analogy and fill-mask evaluations. We found that most
of the models trained on the organic and representative content
outperformed or, at worst, performed on par with the models trained with
inorganic content generated using bots or translated using templates
included, demonstrating that training on unrepresentative content not only
impacts the representation of native speakers but also impacts the
performance of NLP tasks or systems. We recommend avoiding utilizing the
automatically created, generated, or translated articles on Wikipedia when
the task is a representation-based task, like measuring opinions,
sentiments, or perspectives of native speakers, and also suggest that when
registered users employ automated creation or translation, their
contributions should be marked differently than “registered user” for
better transparency; perhaps “registered user (automation-assisted)”.
Best,Kinneret