[Wikimedia-l] The case for supporting open source machine translation

Thu Apr 25 09:31:27 UTC 2013

On 24/04/13 12:35, Denny Vrandečić wrote:
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no

Could you define "big"? If 10% of Wikipedia articles are translations of 
each other, we have 2 million translation pairs. Assuming ten sentences 
per average article, this is 20 million sentence pairs. An average 
Wikipedia with 100,000 articles would have 10,000 translations and 
100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would 
have 100,000 translations and 1,000,000 sentence pairs - is this not 
enough to kickstart a massive machine learning supported system? 
(Consider also that the articles are somewhat similar in structure and 
less rich than general text - future tense is rarely used for example.)