On Fri, Jun 21, 2013 at 2:54 PM, Daniel Kinzler <daniel.kinzler(a)wikimedia.de
wrote:
For literature grade translations, that is, for a true
dictionary, I
believe
that you need to full range of nuances attached to each word and each word
sense, which is distinct from the platonic concepts described by data
items.
Literature grade translations require a knowledge of both cultural domains
and context, sometimes there is no correspondence between concepts. This is
also an amazing quote:
"Why does a translator need a whole workday to translate five pages, and
not an hour or two? ..... About 90% of an average text corresponds to these
simple conditions. But unfortunately, there's the other 10%. It's that part
that requires six [more] hours of work. There are ambiguities one has to
resolve. For instance, the author of the source text, an Australian
physician, cited the example of an epidemic which was declared during World
War II in a "Japanese prisoner of war camp". Was he talking about an
American camp with Japanese prisoners or a Japanese camp with American
prisoners? The English has two senses. It's necessary therefore to do
research, maybe to the extent of a phone call to Australia." -- Claude
Piron, Le défi des langues (The Language Challenge)
However we have a wonderful situation, because:
1) Wikipedia is not a literary work, so the translation requirements are
not that high.
2) It has a lot of users that can manually disambiguate the source text
with semantic annotations, and users in the target language that can fill
the gaps
3) There is prior work done in the RBMT open source world, so there is no
need to start from scratch
4) A translation portal for the wiki world already exists and it is going
to be expanded
Basically almost all the blocks needed to create a powerful MT system for
WP are already there or waiting to be integrated. What I believe is missing
it is the model for storing structurally the morphological information from
Wiktionary templates so the data becomes machine readable and usable. It
will require some prior work to create a coherent model based on the
expression/sense entity types. Doable with some intense full-time
dedication :)
Cheers,
Micru