Content translation with Yandex is also a problem in Bengali Wikipedia.
Some users have grown a tendency to create machine translated meaningless
articles with this extension to increase edit count and article count. This
has increased the workloads of admins to find and delete those articles.
Yandex is not ready for many languages and it is better to shut it. We
don't need it in Bengali.
Regards
On May 3, 2017 12:17 AM, "John Erling Blad" <jeblad(a)gmail.com> wrote:
Actually this _is_ about turning ContentTranslation
off, that is what
several users in the community want. They block people using the extension
and delete the translated articles. Use of ContentTranslation has become a
rather contentious case.
Yandex as a general translation engine to be able to read some alien
language is quite good, but as an engine to produce written text it is not
very good at all. In fact it often creates quite horrible Norwegian, even
for closely related languages. One quite common problem is reordering of
words into meaningless constructs, an other problem is reordering lexical
gender in weird ways. The English preposition "a" is often translated as
"en" in a propositional phrase, and then the gender is added to the
following phrase. That gives a translation of "Oppland is a county in…"
into something like "Oppland er en fylket i…" This should be "Oppland er
et fylke i…".
(I just checked and it seems like Yandex messes up a lot less now than
previously, but it is still pretty bad.)
Apertium works because the language is closely related, Yandex does not
work because it is used between very different languages. People try to use
Yandex and gets disappointed, and falsely conclude that all language
translations are equally weird. They are not, but Yandex translations are
weird.
The numerical threshold does not work. The reason is simple, the number of
fixes depends on language constructs that fails, and that is simply not a
constant for small text fragments. Perhaps if we could flag specific
language constructs that is known to give a high percentage of failures,
and if the translator must check those sentences. One such language
construct is disappearances between the preposition and the gender of the
following term in a prepositional phrase. If they are not similar, then the
sentence must be checked. It is not always wrong to write "en jenta" in
Norwegian, but it is likely to be wrong.
A language model could be a statistical model for the language itself, not
for the translation into that language. We don't want a perfect language
model, but a sufficient language model to mark weird constructs. A very
simple solution could simply be to mark tri-grams that does not already
exist in the text base for the destination as possible errors. It is not
necessary to do a live check, but at least do it before the page can be
saved.
Note the difference in what Yandex do and what we want to achieve; Yandex
translates a text between two different languages, without any clear reason
why. It is not to important if there are weird constructs in the text, as
long as it is usable in "some" context. We translate a text for the purpose
of republishing it. The text should be usable and easily readable in that
language.
On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
amir.aharoni(a)mail.huji.ac.il> wrote:
2017-05-02 18:20 GMT+03:00 John Erling Blad
<jeblad(a)gmail.com>om>:
Brute force solution; turn the ContentTranslation
off. Really stupid
solution.
... Then I guess you don't mind that I'm changing the thread name :)
The next solution; turn the Yandex engine off.
That would solve a
part of the problem. Kind of lousy solution though.
What about adding a language model that warns
when the language
constructs
gets to weird? It is like a "test" for
the translation. The CT is used
for
creating a translation, but the language model is
used for verifying if
the
> translation is good enough. If it does not validate against the
language
model it
should simply not be published to the main name space. It will
still be possible to create a draft, but then the user is completely
aware
that the translation isn't good enough.
Such a language model should be available as a test for any article, as
it
> can be used as a quality measure for the article. It is really a
quantity
measure
for the well-spokenness of the article, but that isn't quite so
intuitive.
So, I'll allow myself to guess that you are talking about one particular
language, probably Norwegian.
Several technical facts:
1. In the past there were several cases in which translators to different
languages who reported common translation mistakes to me. I passed them
on
to Yandex developers, with whom I communicate
quite regularly. They
acknowledged receiving all of them. I am aware of at least one such
common
mistake that was fixed; possibly there were more.
If you can give me a
list
of such mistakes for Norwegian, I'll be very
happy to pass them on. I
absolutely cannot promise that they will be fixed upstream, but it's
possible.
2. In Norwegian, Apertium is used for translating between the two
varieties
of Norwegian itself (Bokmål and Nynorsk), and
from other Scandinavian
languages. That's probably why it works so well—they are similar in
grammar, vocabulary, and narrative style (I'll pass it on to Apertium
developers—I'm sure they'll be happy to hear it). Unfortunately, machine
translation from English is not available in Apertium. Apertium works
best
with very similar languages, and English has two
characteristics, which
are
unfortunate when combined: it is both the most
popular source for
translation into almost all other languages (including Norwegian), and it
is not _very_ similar to any other languages (except maybe Scots).
Machine
translation from English into Norwegian is only
possible with Yandex at
the
moment. More engines may be added in the future,
but at the moment that's
all we have. That's why disabling Yandex completely would indeed be a
lousy
solution: A lot of people say that without
machine translation
integration
Content Translation is useless. Not all users
think like that, but many
do.
3. We can define a numerical threshold of acceptable percentage of
machine
translation post-editing. Currently it's 75%.
It's a tad embarrassing,
but
it's hard-coded at the moment, but it can be
very easily be made into a
variable per language. If the translator tries to publish a page in which
less than that is modified, a warning will be shown.
4. I'm not sure what do you mean by "language model". If it's any kind
of a
linguistic engine, then it's definitely not
within the resources that the
Language team itself can currently dedicate. However, if somebody who
knows
Norwegian and some programming will write a
script that analyzes common
bad
constructs in a Wikipedia dump, this will be very
useful. This would
basically be an upgraded version of suggestion #1 above. (In my spare
time
as a volunteer I'm doing something comparable
for Hebrew, although not
for
translation, but for improving how MediaWiki link
trails work.)
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: Wikimedia-l(a)lists.wikimedia.org
Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: Wikimedia-l(a)lists.wikimedia.org
Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>