Hi Scott,
On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale computermacgyver@gmail.com wrote:
Dear Leila,
==Question==
Do you know of a dataset we can use as ground truth for aligning sections of one article in two languages?
This question is super interesting to me. I am not aware of any ground truth data, but could imagine trying to build some from [[Template:Translated_page]]. At least on enwiki it has a "section" parameter that is to be set:
nice! :) Thanks for sharing it. It is definitely worth looking into it. I did some search across a few languages and the usage of it is limited, in es, around 600, for example and once you start slice and dicing it, the labels become too few. but still, we may be able to use it now or in the future.
==Context== As part of the research we are doing to build recommendation systems that can recommend sections (or templates) for already existing Wikipedia articles, we are looking at the problem of section alignment between languages, i.e., given two languages x and y and two version of article a in these two languages, can an algorithm (with relatively high accuracy) tell us which section in the article in language x correspond to which other section in the article in language y?
While I am not aware of research on Wikipedia section alignment per se, there is a good amount of work on sentence alignment and building parallel/bilingual corpora that seems relevant to to this [1-4]. I can imagine an approach that would look for near matches across two Wikipedia articles in different languages and then examine the distribution of these sentences within sections to see if one or more sections looked to be omitted. One challenge is the sub-article problem [5], which of course you are already familiar. I wonder whether computing the overlap in article links a la Omnipedia [6] and then examining the distribution of these between sections would work and be much less computationally intensive. I fear, however, that this could over identify sections further down an article as missing given (I believe) that article links are often concentrated towards the beginning of an article.
exactly.
a side note: we are trying to stay away, as much as possible, from research/results that rely on NLP techniques as the introduction of NLP will usually translate relatively quickly to limitations on what languages our methodologies can scale to.
Thanks, again! :)
Leila
[1] Learning Joint Multilingual Sentence Representations with Neural Machine Translation. 2017 https://arxiv.org/abs/1704.04154
[2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002. https://www.microsoft.com/en-us/research/publication/fast-and-accurate-sente...
[3] Large scale parallel document mining for machine translation. 2010. http://www.aclweb.org/anthology/C/C10/C10-1124.pdf
[4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010. http://www.academia.edu/download/39073036/building_bilingual_parallel_corpor...
[5] Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia. 2017 http://www.brenthecht.com/publications/cscw17_subarticles.pdf
[6] Omnipedia: Bridging the Wikipedia Language Gap. 2012. http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf
Best wishes, Scott
-- Dr Scott Hale Senior Data Scientist Oxford Internet Institute, University of Oxford Turing Fellow, Alan Turing Institute http://www.scotthale.net/ scott.hale@oii.ox.ac.uk _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l