ground truth for section alignment across languages - Wiki-research-l

24 Aug 2017

Hi all,

==Question==
Do you know of a dataset we can use as ground truth for aligning
sections of one article in two languages? I'm thinking a tool such as
Content Translation may capture this data somewhere, or there may be
some other community initiative that has matched a subset of the
sections between two versions of one article in two languages. Any
insights/directions is appreciated. :) I'm not going to worry about
what language pairs we do have this dataset in right now, the first
question is: do we have anything? :)

==Context==
As part of the research we are doing to build recommendation systems
that can recommend sections (or templates) for already existing
Wikipedia articles, we are looking at the problem of section alignment
between languages, i.e., given two languages x and y and two version
of article a in these two languages, can an algorithm (with relatively
high accuracy) tell us which section in the article in language x
correspond to which other section in the article in language y?

Thanks,
Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation