[Foundation-l] Google Translate now assists with humantranslations of Wikipedia articles

Mon Jun 15 13:30:34 UTC 2009

It depends on how much a priori knowledge you have about the languages.
For the moment people tend to go into two camps, those who want to use
statistical engines and those who want to go for rule based engines.
According to one person there are some activity to include rules into
statistical engines and vica verca but it still needs a lot of work.

Identifying a language isn't that difficult in itself, most search
engines are quite good at that. Many engines can even be told to
interpret the text according to a specific language so the problem is
basically non existent for us.

Still, because our articles has a lot of text that isn't part of a
single language, and in addition there are also specialized markup,
there should be done some kind of parsing before the translation engine
starts processing the text.

After some discussions last winter I am quite sure a rule based engine
work best for small languages, but that a working solution should use
some kind of self learning mechanism to refine the translation or at
least identify errors.

Our idea was to use statistics to identify cases where existing rules
failed, and let people define the new rules. Failing rules would be
detected by checking which translated sentences got changed afterwards.
Actually it is a bit more difficult than this,.. ;)

And no, I'm not a linguist...

John

>>> One of the most important things that is needed for adding languages to a
>>> technology like this is having a sufficiently sized corpus.
>> Yes, that was basically my main question: What is sufficiently? How much
>> pages or MB of text? At least the order of magnitude.
>>
>> Marcus Buck