Hi all,
==Question== Do you know of a dataset we can use as ground truth for aligning sections of one article in two languages? I'm thinking a tool such as Content Translation may capture this data somewhere, or there may be some other community initiative that has matched a subset of the sections between two versions of one article in two languages. Any insights/directions is appreciated. :) I'm not going to worry about what language pairs we do have this dataset in right now, the first question is: do we have anything? :)
==Context== As part of the research we are doing to build recommendation systems that can recommend sections (or templates) for already existing Wikipedia articles, we are looking at the problem of section alignment between languages, i.e., given two languages x and y and two version of article a in these two languages, can an algorithm (with relatively high accuracy) tell us which section in the article in language x correspond to which other section in the article in language y?
Thanks, Leila
-- Leila Zia Senior Research Scientist Wikimedia Foundation