Hoi, Sorry to state the obvious (for me) .. We datamine Wikipedias for statements in Wikipedia. Consequently much information that could be / should be in an article (in any and all languages) is reflected by Wikidata. There is much that is not found in every language and information on some subjects can easily be provided from Wikidata as a list (think awards, books published etc). The good news is that Wikidata will provide lists for this purpose. For all other topics like date of death / birth and place of death / birth where people studied etc you have the benefit of existing articles in a Wikipedia and the work done at Wikidata.
Hope this helps. Thanks, GerardM
On 24 August 2017 at 19:56, Leila Zia leila@wikimedia.org wrote:
Hi all,
==Question== Do you know of a dataset we can use as ground truth for aligning sections of one article in two languages? I'm thinking a tool such as Content Translation may capture this data somewhere, or there may be some other community initiative that has matched a subset of the sections between two versions of one article in two languages. Any insights/directions is appreciated. :) I'm not going to worry about what language pairs we do have this dataset in right now, the first question is: do we have anything? :)
==Context== As part of the research we are doing to build recommendation systems that can recommend sections (or templates) for already existing Wikipedia articles, we are looking at the problem of section alignment between languages, i.e., given two languages x and y and two version of article a in these two languages, can an algorithm (with relatively high accuracy) tell us which section in the article in language x correspond to which other section in the article in language y?
Thanks, Leila
-- Leila Zia Senior Research Scientist Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l