On 24/04/13 12:35, Denny Vrandečić wrote:
Current machine translation research aims at using massive machine learning supported systems. They usually require big parallel corpora. We do not have big parallel corpora (Wikipedia articles are not translations of each other, in general), especially not for many languages, and there is no
Could you define "big"? If 10% of Wikipedia articles are translations of each other, we have 2 million translation pairs. Assuming ten sentences per average article, this is 20 million sentence pairs. An average Wikipedia with 100,000 articles would have 10,000 translations and 100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would have 100,000 translations and 1,000,000 sentence pairs - is this not enough to kickstart a massive machine learning supported system? (Consider also that the articles are somewhat similar in structure and less rich than general text - future tense is rarely used for example.)