Dear Leila,
==Question==
Do you know of a dataset we can use as ground truth
for aligning
sections of one article in two languages?
This question is super interesting to me. I am not aware of any ground
truth data, but could imagine trying to build some from
[[Template:Translated_page]]. At least on enwiki it has a "section"
parameter that is to be set:
If the inserted translation is contained in one
section of the target
page, insert its name here. (A direct link to that section will be created.)
It also has a "version" parameter, and it might be possible to identify
cases where a section was added to the source after the translation was
made. This could then become a corpus to "learn the missing section". I
guess something similar could be done with articles created with the
Content Translation tool where a section was later added to the source.
==Context==
As part of the research we are doing to build recommendation systems
that can recommend sections (or templates) for already existing
Wikipedia articles, we are looking at the problem of section alignment
between languages, i.e., given two languages x and y and two version
of article a in these two languages, can an algorithm (with relatively
high accuracy) tell us which section in the article in language x
correspond to which other section in the article in language y?
While I am not aware of research on Wikipedia section alignment per se,
there is a good amount of work on sentence alignment and building
parallel/bilingual corpora that seems relevant to to this [1-4]. I can
imagine an approach that would look for near matches across two Wikipedia
articles in different languages and then examine the distribution of these
sentences within sections to see if one or more sections looked to be
omitted. One challenge is the sub-article problem [5], which of course you
are already familiar. I wonder whether computing the overlap in article
links a la Omnipedia [6] and then examining the distribution of these
between sections would work and be much less computationally intensive. I
fear, however, that this could over identify sections further down an
article as missing given (I believe) that article links are often
concentrated towards the beginning of an article.
[1] Learning Joint Multilingual Sentence Representations with Neural
Machine Translation. 2017
https://arxiv.org/abs/1704.04154
[2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
https://www.microsoft.com/en-us/research/publication/fast-and-accurate-sent…
[3] Large scale parallel document mining for machine translation. 2010.
http://www.aclweb.org/anthology/C/C10/C10-1124.pdf
[4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
http://www.academia.edu/download/39073036/building_bilingual_parallel_corpo…
[5] Problematizing and Addressing the Article-as-Concept Assumption in
Wikipedia. 2017
http://www.brenthecht.com/publications/cscw17_subarticles.pdf
[6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf
Best wishes,
Scott
--
Dr Scott Hale
Senior Data Scientist
Oxford Internet Institute, University of Oxford
Turing Fellow, Alan Turing Institute
http://www.scotthale.net/
scott.hale(a)oii.ox.ac.uk