Re: [Wiki-research-l] ground truth for section alignment across languages

30 Aug 2017

Hi Leila,

I can point you to two methods: CL-ESA and CL-CNG.

Cross-Language Explicit Semantic Analyse (CL-ESA):
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2008b.pdf

This model allows for language-independent comparison of texts without
relying on parallel corpora or translation dictionaries for training.
Rather, it exploits the cross-language links of Wikipedia articles to embed
documents from two or more languages in a joint vector space, rendering
them directly comparable, e.g., using cosine similarity. The more language
links exit between two Wikipedia languages, the higher the dimensionality
of the joint vector space can be made, and the better a cross-language
ranking will perform. At the document level, near-perfect recall on a
ranking task is achieved at 100,000 dimensions (=linked articles across
languages). See Table 2 of the paper. The model is easy to be implemented,
however, somewhat expensive to compute.

Cross-language Character N-Gram model (CL-CNG):
In subsequent experiments, we compared the model with alternatives; one
that is trained on the basis of a parallel corpus, and another that simply
exploits lexical overlap of character N-grams between pairs of documents
from different languages:
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2011b.pdf

As it turns out, CL-C3G (i.e., N=3) is extremely effective, too, on
language pairs that share an alphabet and where lexical overlap can be
expected, e.g., due to them having a common ancestor. So, it works very
well for German-Dutch, but less so for English-Russian. In the latter case,
CL-ESA works, though. The CL-CNG model is even easier to be implemented and
very scalable. Dependent on the language pairs you are investigating, this
model may help a great deal.

Perhaps these models may be of use when building a cross-language alignment
tool.

Best,
Martin

-- 
Dr. Martin Potthast
Bauhaus-Universität Weimar
Digital Bauhaus Lab
Bauhausstr. 9a
99423 Weimar
Germany

+49 3643 58 3567
+49 171 809 1945

www.potthast.net

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] ground truth for section alignment across languages