Re: [Wiki-research-l] ground truth for section alignment across languages - Wiki-research-l

30 Aug 2017

Hi Leila,

...
 From the top of my head, I can think of this paper only
I've read a while ago:
https://eprints.soton.ac.uk/403386/1/tweb_gottschalk_demidova_multiwiki.pdf

I assume what is to be considered is the (lack of) content overlap of
articles in different languages in general as of, for example,
http://dl.acm.org/citation.cfm?id=1753370 which also compares different
language Wikipedias but more in the sense of completeness.

Sounds like interesting work, looking forward to seeing what you come up
with!

All the best,

Lucie

On 30 August 2017 at 00:13, Leila Zia &lt;leila(a)wikimedia.org&gt; wrote:

...
  Hi Scott,

 On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale &lt;computermacgyver(a)gmail.com&gt;
 wrote:
  Dear Leila,

 ==Question==
  Do you know of a dataset we can use as ground
truth for aligning
 sections of one article in two languages?

 This question is super interesting to me. I am not aware of any ground
 truth data, but could imagine trying to build some from
 [[Template:Translated_page]]. At least on enwiki it has a "section"
 parameter that is to be set: 
 nice! :) Thanks for sharing it. It is definitely worth looking into
 it. I did some search across a few languages and the usage of it is
 limited, in es, around 600, for example and once you start slice and
 dicing it, the labels become too few. but still, we may be able to use
 it now or in the future.

==Context==
 As part of the research we are doing to build recommendation systems
 that can recommend sections (or templates) for already existing
 Wikipedia articles, we are looking at the problem of section alignment
 between languages, i.e., given two languages x and y and two version
 of article a in these two languages, can an algorithm (with relatively
 high accuracy) tell us which section in the article in language x
 correspond to which other section in the article in language y?

 While I am not aware of research on Wikipedia section alignment per se,
 there is a good amount of work on sentence alignment and building
 parallel/bilingual corpora that seems relevant to to this [1-4]. I can
 imagine an approach that would look for near matches across two Wikipedia
 articles in different languages and then examine the distribution of  these
  sentences within sections to see if one or more
sections looked to be
 omitted. One challenge is the sub-article problem [5], which of course  you
  are already familiar. I wonder whether computing
the overlap in article
 links a la Omnipedia [6] and then examining the distribution of these
 between sections would work and be much less computationally intensive. I
 fear, however, that this could over identify sections further down an
 article as missing given (I believe) that article links are often
 concentrated towards the beginning of an article. 
 exactly.

 a side note: we are trying to stay away, as much as possible, from
 research/results that rely on NLP techniques as the introduction of
 NLP will usually translate relatively quickly to limitations on what
 languages our methodologies can scale to.

 Thanks, again! :)

 Leila

 [1] Learning Joint Multilingual Sentence Representations with Neural
 Machine Translation. 2017
 https://arxiv.org/abs/1704.04154

 [2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
 https://www.microsoft.com/en-us/research/publication/fast- 
and-accurate-sentence-alignment-of-bilingual-corpora/

 [3] Large scale parallel document mining for machine translation. 2010.
 http://www.aclweb.org/anthology/C/C10/C10-1124.pdf

 [4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
 http://www.academia.edu/download/39073036/building_ 
bilingual_parallel_corpora.pdf

 [5] Problematizing and Addressing the Article-as-Concept Assumption in
 Wikipedia. 2017
 http://www.brenthecht.com/publications/cscw17_subarticles.pdf

 [6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
 http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf

 Best wishes,
 Scott

 --
 Dr Scott Hale
 Senior Data Scientist
 Oxford Internet Institute, University of Oxford
 Turing Fellow, Alan Turing Institute
 http://www.scotthale.net/
 scott.hale(a)oii.ox.ac.uk
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l 
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l