[Wiki-research-l] Generation of Wikipedia Summaries from Wikidata in Underserved Languages using Deep Learning

4 Apr 2018


      Wikimedia as a movement has over the years given consideration to small
language Wikipedias.
I would like to point you to a recent study I alongside with Hady Elsahar
of the Université de Lyon and Pavlos Vougiouklis of the University of
Southampton have been pursuing, which has been recently translated to
accepted publications.
My research interest involves mainly underserved languages on Wikidata and
Wikipedia, and how we can support them better.
One of the ways to support small Wikipedias was the ArticlePlaceholder [1].
The idea is to use the existing multilingual information in Wikidata [2]
and display it in a reader friendly way on Wikipedia in the respective
language (if a Wikidata label exists in this language).
However, at the moment the data is given only in a tabular form, which is
not very reader friendly and might not be the ideal way to engage editors
to work on the articles.
Therefore, we worked on producing sentences from the information on
Wikidata in the given language. We trained a neural network model, the
details can be found in the preprint of the NAACL paper here:
https://arxiv.org/abs/1803.07116
Given the promising results of the approach using our neural network, we
extended the work to see how we could fit in this text generation into the
existing ArticlePlaceholder and tested it with the Esperanto and Arabic
Wikipedia communities. The ESWC paper preprint for this work can be found
here:
https://2018.eswc-conferences.org/wp-content/uploads/2018/02/ESWC2018_paper_...
We show that our approach is feasible for generating text from Wikidata for
Wikipedia. Editors tend to reuse the sentences, which shows it can be a
good encouragement to create full articles from those summaries.
We would like to implement the work in a test Wikipedia to see if
communities are interested in adopting the technology on a large scale in
their Wikipedias.
Furthermore, we would love to hear your input: Do you believe, one sentence
summaries are enough, can we serve the communities needs better with more
than one sentence? Is this still true if longer abstracts would be of lower
text quality? What other interesting use cases for such a technology in the
Wikimedia world can you imagine? And especially if you are part of a
underserved language Wikipedia community, what is your opinion to the
project?
[1] https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder and
https://commons.wikimedia.org/wiki/File:Generating_Article_Placeholders_from...
[2]
https://eprints.soton.ac.uk/413433/1/Open_Sym_Short_Paper_Wikidata_Multiling...
-- 
Lucie-Aimée Kaffee
Web and Internet Science Group
School of Electronics and Computer Science
University of Southampton

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Generation of Wikipedia Summaries from Wikidata in Underserved Languages using Deep Learning