Hi all,
A quick message to point out a recent model for crosslingual document
embedding that we developed. It is called Cr5 (for "Crosslingual Document
Embedding as Reduced-Rank Ridge Regression") and essentially lets you take
any text document in any language and represent it as a vector in a
language-independent way, such that documents can be compared across
languages. For instance, the Finnish Wikipedia article Olut
<https://fi.wikipedia.org/wiki/Olut> will result in a similar vector
representation as the English article Beer
<https://en.wikipedia.org/wiki/Beer>, since they are about the same
concept, and despite the fact that they have nearly no surface-level
similarities in terms of vocabulary etc.
We are publishing a pre-trained model with a small API that is very easy to
use:
https://github.com/epfl-dlab/Cr5
The model currently supports 28 languages [1], but it can readily be
trained for different sets of languages (code provided on GitHub, see
above).
The provided model was trained on Wikipedia (surprise...) and essentially
sources information on how words in different languages correspond to one
another from the crosslingual article alignments provided by Wikidata [2].
While the resulting model can be applied to any text (not just Wikipedia
articles), it works particularly well on Wikipedia [3] -- which is the
reason I'm writing this email: I really hope that the community will start
using Cr5 to make better sense of Wikipedia across languages. Ideas abound:
crosslingual section alignment, crosslingual plagiarism detection,
comparison of topical foci across languages, crosslingual keyword search,
etc. etc.
If there are any questions or comments, do drop us a line!
Bob
[1] bg, ca, cs, da, de, el, en, es, et, fi, fr, hr, hu, id, it, mk, nl, no,
pl, pt, ro, ru, sk, sl, sv, tr, uk, vi
[2] For more details on the method, please see the paper:
https://dlab.epfl.ch/people/west/pub/Josifoski-Paskov-Paskov-Jaggi-West_WSD…
[3] In fact, it achieves state-of-the-art performance in the context of
Wikipedia, outperforming the previously best method, published by Facebook,
by a wide margin.