Hi all,
The next Research Showcase will be live-streamed next Wednesday, July 24, at 9:30 AM PST / 16:30 UTC. Find your local time here https://zonestamp.toolforge.org/1721838600. The theme for this showcase is *Machine Translation on Wikipedia*.
You are welcome to watch via the YouTube stream: https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations: The Promise and Pitfalls of AI Technology in Bridging Digital Language DivideBy *Kai Zhu, Bocconi University*Machine translation technologies have the potential to bridge knowledge gaps across languages, promoting more inclusive access to information regardless of native languages. This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration leads to a 149% increase in content production through translation, driven by existing editors becoming more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context. However, our findings also underscore the need for continued efforts to mitigate the preexisting systemic barriers. Our study contributes to our knowledge on the evolving role of artificial intelligence in shaping knowledge dissemination through enhanced language translation capabilities.Implications of Using Inorganic Content in Arabic Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, Clarkson University*Wikipedia articles (content pages) are one of the widely utilized training corpora for NLP tasks and systems, yet these articles are not always created, generated, or even edited organically by native speakers; some are automatically created, generated, or translated using Wikipedia bots or off-the-shelf translation tools like Google Translate without human revision or supervision. We first analyzed the three Arabic Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic (ARY), and found that these Arabic Wikipedia editions suffer from a few serious issues, like large-scale automatic creations and translations from English to Arabic, all without human involvement, generating content (articles) that lack not only linguistic richness and diversity but also content that lacks cultural richness and meaningful representation of the Arabic language and its native speakers. We second studied the performance implications of using such inorganic, unrepresentative articles to train NLP tasks or systems, where we intrinsically evaluated the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy and fill-mask evaluations. We found that most of the models trained on the organic and representative content outperformed or, at worst, performed on par with the models trained with inorganic content generated using bots or translated using templates included, demonstrating that training on unrepresentative content not only impacts the representation of native speakers but also impacts the performance of NLP tasks or systems. We recommend avoiding utilizing the automatically created, generated, or translated articles on Wikipedia when the task is a representation-based task, like measuring opinions, sentiments, or perspectives of native speakers, and also suggest that when registered users employ automated creation or translation, their contributions should be marked differently than “registered user” for better transparency; perhaps “registered user (automation-assisted)”. Best,Kinneret
Hello all,
Quick reminder that we will be starting our monthly Research Showcase on *Machine Translation on Wikipedia* in 30 minutes. Join us at https://www.youtube.com/live/O7AqvHgqUVk.
Best, Kinneret
On Fri, Jul 19, 2024 at 3:12 PM Kinneret Gordon kgordon@wikimedia.org wrote:
Hi all,
The next Research Showcase will be live-streamed next Wednesday, July 24, at 9:30 AM PST / 16:30 UTC. Find your local time here https://zonestamp.toolforge.org/1721838600. The theme for this showcase is *Machine Translation on Wikipedia*.
You are welcome to watch via the YouTube stream: https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations: The Promise and Pitfalls of AI Technology in Bridging Digital Language DivideBy *Kai Zhu, Bocconi University*Machine translation technologies have the potential to bridge knowledge gaps across languages, promoting more inclusive access to information regardless of native languages. This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration leads to a 149% increase in content production through translation, driven by existing editors becoming more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context. However, our findings also underscore the need for continued efforts to mitigate the preexisting systemic barriers. Our study contributes to our knowledge on the evolving role of artificial intelligence in shaping knowledge dissemination through enhanced language translation capabilities.Implications of Using Inorganic Content in Arabic Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, Clarkson University*Wikipedia articles (content pages) are one of the widely utilized training corpora for NLP tasks and systems, yet these articles are not always created, generated, or even edited organically by native speakers; some are automatically created, generated, or translated using Wikipedia bots or off-the-shelf translation tools like Google Translate without human revision or supervision. We first analyzed the three Arabic Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic (ARY), and found that these Arabic Wikipedia editions suffer from a few serious issues, like large-scale automatic creations and translations from English to Arabic, all without human involvement, generating content (articles) that lack not only linguistic richness and diversity but also content that lacks cultural richness and meaningful representation of the Arabic language and its native speakers. We second studied the performance implications of using such inorganic, unrepresentative articles to train NLP tasks or systems, where we intrinsically evaluated the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy and fill-mask evaluations. We found that most of the models trained on the organic and representative content outperformed or, at worst, performed on par with the models trained with inorganic content generated using bots or translated using templates included, demonstrating that training on unrepresentative content not only impacts the representation of native speakers but also impacts the performance of NLP tasks or systems. We recommend avoiding utilizing the automatically created, generated, or translated articles on Wikipedia when the task is a representation-based task, like measuring opinions, sentiments, or perspectives of native speakers, and also suggest that when registered users employ automated creation or translation, their contributions should be marked differently than “registered user” for better transparency; perhaps “registered user (automation-assisted)”. Best,Kinneret