2017-05-02 21:47 GMT+03:00 John Erling Blad jeblad@gmail.com:
Yandex as a general translation engine to be able to read some alien language is quite good, but as an engine to produce written text it is not very good at all.
... Nor is it supposed to be.
A translator is a person. Machine translation software is not a person, it's software. It's a tool that is supposed to help a human translator produce a good written text more quickly. If it doesn't make this work faster, it can and should be disabled. If no translator
In fact it often creates quite horrible Norwegian, even for closely related languages. One quite common problem is reordering of words into meaningless constructs, an other problem is reordering lexical gender in weird ways. The English preposition "a" is often translated as "en" in a propositional phrase, and then the gender is added to the following phrase. That gives a translation of "Oppland is a county in…" into something like "Oppland er en fylket i…" This should be "Oppland er et fylke i…".
I suggest making a page with a list of such examples, so that the machine translation developers could read it.
(I just checked and it seems like Yandex messes up a lot less now than previously, but it is still pretty bad.)
I guess that this is something that Yandex developers will be happy to hear :)
More seriously, it's quite possible that they already used some of the translations made by the Norwegian Wikipedia community. In addition to being published as an article, each translated paragraph is saved into parallel corpora, and machine translation developers read the edited text and use it to improve their software. This is completely open and usable by all machine translation developers, not only for Yandex.
The numerical threshold does not work. The reason is simple, the number of fixes depends on language constructs that fails, and that is simply not a constant for small text fragments. Perhaps if we could flag specific language constructs that is known to give a high percentage of failures, and if the translator must check those sentences. One such language construct is disappearances between the preposition and the gender of the following term in a prepositional phrase.
The question is how would we do it with our software. I simply cannot imagine doing it with the current MediaWiki platform, unless we develop a sophisticated NLP engine, although it's possible I'm exaggerating or forgetting something.
A language model could be a statistical model for the language itself, not for the translation into that language. We don't want a perfect language model, but a sufficient language model to mark weird constructs. A very simple solution could simply be to mark tri-grams that does not already exist in the text base for the destination as possible errors. It is not necessary to do a live check, but at least do it before the page can be saved.
See above—we don't have support for plugging something like that into our workflow.
Perhaps one day some AI/machine-learning system like ORES would be able to do it. Maybe it could be an extension to ORES itself.
Note the difference in what Yandex do and what we want to achieve; Yandex translates a text between two different languages, without any clear reason why. It is not to important if there are weird constructs in the text, as long as it is usable in "some" context. We translate a text for the purpose of republishing it. The text should be usable and easily readable in that language.
This is a well-known problem in machine translation: domain.
Professional industrial translation powerhouses use internally-customized machine translation engines that specialize on particular domains, such as medicine, law, or news. In theory, it would make a lot of sense to have a customized machine translation engine for encyclopedic articles, or maybe even for several different styles of encyclopedic articles (biography, science, history, etc.). For now what we have is a very general-purpose consumer-oriented engine. I hope it changes in the future.