Languages March 2024

languages@lists.wikimedia.org

2 participants
1 discussions

Natural Language Processing Project using Wikidata for Minority Languages
by Sophie Fitzpatrick 27 Mar '24

27 Mar '24

Hello Wikimedians, My name is Sophie and I am the Project and Communications Manager at Wikimedia Community Ireland. I am reaching out to draw your attention to an interesting Natural Language Processing Project that uses Wikidata to generate content in different languages, at DCU here in Ireland. We have been collaborating with Simon Mille from the Adapt Centre <https://www.adaptcentre.ie/> recently and I thought it might be good to make some connections with the wider Wiki Community who are specifically interested in or involved with AI. Please let me introduce the project below and if you would like to learn more or connect with Simon I would be delighted to introduce you. Kind regards, Sophie Fitzpatrick *Project description:* At DCU-NLG <https://dcu-nlg.github.io/>, one of our main research topics is the automatic generation of text from structured data. We work with structured repositories such as DBpedia and Wikidata (among other resources), which contain millions of triples that can be used to generate texts about targeted entities in a particular language. A lot of techniques exist for generating text from triple sets, the most famous (and probably best) one being prompting a GPT model. However, closed-source models such as the GPT series have some important drawbacks: they are very much resource-hungry, they are not easily controllable, and they do not give researchers access to their code. At DCU-NLG, we develop open-source systems that aim to address these issues in the domain of Natural Language Generation. We build (i) generators based on Large Language Models (LLMs), which can achieve very high-quality results but still require a large amount if energy to work, (ii) fully rule-based systems, which are extremely energy-efficient but struggle to get to the quality level of LLMs, and (iii) hybrid systems, which aim at combining the strengths of LLMs, rule-based systems and neural systems. We are also interested in the real-world use of these systems, and are currently making a tool that could help people write Wikipedia articles: we are designing an interface that, given an entity and a language, returns small seed texts generated using several techniques mentioned above, always using DBpedia or Wikidata information to ensure the traceability of the source. People can then use these seed texts as a starting point for editing a new Wikipedia page. *Some resources:* - RTE brainstorm article <https://www.rte.ie/brainstorm/2023/1206/1420417-gaeilge-irish-translation-a…> (by the way it's funny how they use the word "translate" in their title, knowing the time I spend talking about how NLG is not translation xD) - Papers from our group about using GPT <https://aclanthology.org/2023.mmnlg-1.9/> and a rule-based system <https://aclanthology.org/2023.pandl-1.4/> for the generation of Irish text from DBpedia. - The GEM shared task <https://gem-benchmark.com/shared_task> about generation from DBpedia and WIkidata, which I co-organise. -- Sophie Fitzpatrick *Project and Communications Manager* [image: Wikimedia Community Ireland Bi-Lingual Logo] Pobal Wikimedia na hÉireann | Wikimedia Community Ireland https://wikimedia.ie/

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Languages March 2024