I can point you to two methods: CL-ESA and CL-CNG.
Cross-Language Explicit Semantic Analyse (CL-ESA):
This model allows for language-independent comparison of texts without
relying on parallel corpora or translation dictionaries for training.
Rather, it exploits the cross-language links of Wikipedia articles to embed
documents from two or more languages in a joint vector space, rendering
them directly comparable, e.g., using cosine similarity. The more language
links exit between two Wikipedia languages, the higher the dimensionality
of the joint vector space can be made, and the better a cross-language
ranking will perform. At the document level, near-perfect recall on a
ranking task is achieved at 100,000 dimensions (=linked articles across
languages). See Table 2 of the paper. The model is easy to be implemented,
however, somewhat expensive to compute.
Cross-language Character N-Gram model (CL-CNG):
In subsequent experiments, we compared the model with alternatives; one
that is trained on the basis of a parallel corpus, and another that simply
exploits lexical overlap of character N-grams between pairs of documents
from different languages:
As it turns out, CL-C3G (i.e., N=3) is extremely effective, too, on
language pairs that share an alphabet and where lexical overlap can be
expected, e.g., due to them having a common ancestor. So, it works very
well for German-Dutch, but less so for English-Russian. In the latter case,
CL-ESA works, though. The CL-CNG model is even easier to be implemented and
very scalable. Dependent on the language pairs you are investigating, this
model may help a great deal.
Perhaps these models may be of use when building a cross-language alignment
Dr. Martin Potthast
Digital Bauhaus Lab
+49 3643 58 3567
+49 171 809 1945