On Fri, Jun 21, 2013 at 2:54 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de> wrote:
For literature grade translations, that is, for a true dictionary, I believe
that you need to full range of nuances attached to each word and each word
sense, which is distinct from the platonic concepts described by data items.
Literature grade translations require a knowledge of both cultural domains and context, sometimes there is no correspondence between concepts. This is also an amazing quote:
"Why does a translator need a whole workday to translate five pages, and
not an hour or two? ..... About 90% of an average text corresponds to
these simple conditions. But unfortunately, there's the other 10%. It's
that part that requires six [more] hours of work. There are ambiguities
one has to resolve. For instance, the author of the source text, an
Australian physician, cited the example of an epidemic which was
declared during World War II in a "Japanese prisoner of war camp". Was
he talking about an American camp with Japanese prisoners or a Japanese
camp with American prisoners? The English has two senses. It's necessary
therefore to do research, maybe to the extent of a phone call to
Australia." -- Claude Piron, Le défi des langues (The Language Challenge)
However we have a wonderful situation, because:
1) Wikipedia is not a literary work, so the translation requirements are not that high.
2) It has a lot of users that can manually disambiguate the source text with semantic annotations, and users in the target language that can fill the gaps
3) There is prior work done in the RBMT open source world, so there is no need to start from scratch
4) A translation portal for the wiki world already exists and it is going to be expanded
Basically almost all the blocks needed to create a powerful MT system for WP are already there or waiting to be integrated. What I believe is missing it is the model for storing structurally the morphological information from Wiktionary templates so the data becomes machine readable and usable. It will require some prior work to create a coherent model based on the expression/sense entity types. Doable with some intense full-time dedication :)
Cheers,
Micru