2017-05-02 21:47 GMT+03:00 John Erling Blad <jeblad(a)gmail.com>om>:
Yandex as a general translation engine to be able to
read some alien
language is quite good, but as an engine to produce written text it is not
very good at all.
... Nor is it supposed to be.
A translator is a person. Machine translation software is not a person,
it's software. It's a tool that is supposed to help a human translator
produce a good written text more quickly. If it doesn't make this work
faster, it can and should be disabled. If no translator
In fact it often creates quite horrible Norwegian,
even
for closely related languages. One quite common problem is reordering of
words into meaningless constructs, an other problem is reordering lexical
gender in weird ways. The English preposition "a" is often translated as
"en" in a propositional phrase, and then the gender is added to the
following phrase. That gives a translation of "Oppland is a county in…"
into something like "Oppland er en fylket i…" This should be "Oppland er
et fylke i…".
I suggest making a page with a list of such examples, so that the machine
translation developers could read it.
(I just checked and it seems like Yandex messes up a
lot less now than
previously, but it is still pretty bad.)
I guess that this is something that Yandex developers will be happy to hear
:)
More seriously, it's quite possible that they already used some of the
translations made by the Norwegian Wikipedia community. In addition to
being published as an article, each translated paragraph is saved into
parallel corpora, and machine translation developers read the edited text
and use it to improve their software. This is completely open and usable by
all machine translation developers, not only for Yandex.
The numerical threshold does not work. The reason is
simple, the number of
fixes depends on language constructs that fails, and that is simply not a
constant for small text fragments. Perhaps if we could flag specific
language constructs that is known to give a high percentage of failures,
and if the translator must check those sentences. One such language
construct is disappearances between the preposition and the gender of the
following term in a prepositional phrase.
The question is how would we do it with our software. I simply cannot
imagine doing it with the current MediaWiki platform, unless we develop a
sophisticated NLP engine, although it's possible I'm exaggerating or
forgetting something.
A language model could be a statistical model for the
language itself, not
for the translation into that language. We don't want a perfect language
model, but a sufficient language model to mark weird constructs. A very
simple solution could simply be to mark tri-grams that does not already
exist in the text base for the destination as possible errors. It is not
necessary to do a live check, but at least do it before the page can be
saved.
See above—we don't have support for plugging something like that into our
workflow.
Perhaps one day some AI/machine-learning system like ORES would be able to
do it. Maybe it could be an extension to ORES itself.
Note the difference in what Yandex do and what we want
to achieve; Yandex
translates a text between two different languages, without any clear reason
why. It is not to important if there are weird constructs in the text, as
long as it is usable in "some" context. We translate a text for the purpose
of republishing it. The text should be usable and easily readable in that
language.
This is a well-known problem in machine translation: domain.
Professional industrial translation powerhouses use internally-customized
machine translation engines that specialize on particular domains, such as
medicine, law, or news. In theory, it would make a lot of sense to have a
customized machine translation engine for encyclopedic articles, or maybe
even for several different styles of encyclopedic articles (biography,
science, history, etc.). For now what we have is a very general-purpose
consumer-oriented engine. I hope it changes in the future.