On 24/04/13 12:35, Denny Vrandečić wrote:
Current machine translation research aims at using
massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no
Could you define "big"? If 10% of Wikipedia articles are translations of
each other, we have 2 million translation pairs. Assuming ten sentences
per average article, this is 20 million sentence pairs. An average
Wikipedia with 100,000 articles would have 10,000 translations and
100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would
have 100,000 translations and 1,000,000 sentence pairs - is this not
enough to kickstart a massive machine learning supported system?
(Consider also that the articles are somewhat similar in structure and
less rich than general text - future tense is rarely used for example.)