[Wikimedia-l] The case for supporting open source machine translation
Nikola Smolenski
smolensk at eunet.rs
Thu Apr 25 09:31:27 UTC 2013
On 24/04/13 12:35, Denny Vrandečić wrote:
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no
Could you define "big"? If 10% of Wikipedia articles are translations of
each other, we have 2 million translation pairs. Assuming ten sentences
per average article, this is 20 million sentence pairs. An average
Wikipedia with 100,000 articles would have 10,000 translations and
100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would
have 100,000 translations and 1,000,000 sentence pairs - is this not
enough to kickstart a massive machine learning supported system?
(Consider also that the articles are somewhat similar in structure and
less rich than general text - future tense is rarely used for example.)
More information about the Wikimedia-l
mailing list