Re: [Wikimedia-l] machine translation

2 May 2017

I think it all depends on the level of engagement of the human translator.

When the tool is used in the right way, it is a fantastic tool.

Maybe we can find better methods to nudge people toward taking their time
and really doing work on their translations.

Thanks,
Pharos

On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
bodhisattwa.rgkmc(a)gmail.com&gt; wrote:

...
  Content translation with Yandex is also a problem in
Bengali Wikipedia.
 Some users have grown a tendency to create machine translated meaningless
 articles with this extension to increase edit count and article count. This
 has increased the workloads of admins to find and delete those articles.

 Yandex is not ready for many languages and it is better to shut it. We
 don't need it in Bengali.

 Regards
 On May 3, 2017 12:17 AM, "John Erling Blad" &lt;jeblad(a)gmail.com&gt; wrote:

  Actually this _is_ about turning
ContentTranslation off, that is what
 several users in the community want. They block people using the  extension
  and delete the translated articles. Use of
ContentTranslation has become  a
   rather contentious case.

 Yandex as a general translation engine to be able to read some alien
 language is quite good, but as an engine to produce written text it is  not
  very good at all. In fact it often creates quite
horrible Norwegian, even
 for closely related languages. One quite common problem is reordering of
 words into meaningless constructs, an other problem is reordering lexical
 gender in weird ways. The English preposition "a" is often translated as
 "en" in a propositional phrase, and then the gender is added to the
 following phrase. That gives a translation of  "Oppland is a county in…"
  into something like "Oppland er en fylket i…" This should be "Oppland er
 et fylke i…".

 (I just checked and it seems like Yandex messes up a lot less now than
 previously, but it is still pretty bad.)

 Apertium works because the language is closely related, Yandex does not
 work because it is used between very different languages. People try to  use
  Yandex and gets disappointed, and falsely
conclude that all language
 translations are equally weird. They are not, but Yandex translations are
 weird.

 The numerical threshold does not work. The reason is simple, the number  of
  fixes depends on language constructs that fails,
and that is simply not a
 constant for small text fragments. Perhaps if we could flag specific
 language constructs that is known to give a high percentage of failures,
 and if the translator must check those sentences. One such language
 construct is disappearances between the preposition and the gender of the
 following term in a prepositional phrase. If they are not similar, then  the
  sentence must be checked. It is not always wrong
to write "en jenta" in
 Norwegian, but it is likely to be wrong.

 A language model could be a statistical model for the language itself,  not
  for the translation into that language. We
don't want a perfect language
 model, but a sufficient language model to mark weird constructs. A very
 simple solution could simply be to mark tri-grams that does not  already
 exist in the text base for the destination as possible errors. It is not
 necessary to do a live check, but  at least do it before the page can be
 saved.

 Note the difference in what Yandex do and what we want to achieve; Yandex
 translates a text between two different languages, without any clear  reason
  why. It is not to important if there are weird
constructs in the text, as
 long as it is usable in "some" context. We translate a text for the 
purpose
  of republishing it. The text should be usable and
easily readable in that
 language.

 On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
 amir.aharoni(a)mail.huji.ac.il&gt; wrote:

 > 2017-05-02 18:20 GMT+03:00 John Erling Blad &lt;jeblad(a)gmail.com&gt;om>:
 >
 > > Brute force solution; turn the ContentTranslation off. Really stupid
 > > solution.
 >
 >
 > ... Then I guess you don't mind that I'm changing the thread name :)
 >
 >
 > > The next solution; turn the Yandex engine off. That would solve a
 > > part of the problem. Kind of lousy solution though.
 > >
 >
 > > What about adding a language model that warns when the language
 > constructs
 > > gets to weird? It is like a "test" for the translation. The CT is
 used
  > for
 > > creating a translation, but the language model is used for verifying  if
   the
 > translation is good enough. If it does not validate against the  language
 > > model it should simply not be published to the main name space. It  will
  > > still be possible to create a draft,
but then the user is completely
 > aware
 > > that the translation isn't good enough.
 > >
 > > Such a language model should be available as a test for any article,  as
   it
 > can be used as a quality measure for the article. It is really a  quantity
 > > measure for the well-spokenness of the article, but that isn't quite 
so
  > > intuitive.
 > >
 >
 > So, I'll allow myself to guess that you are talking about one  particular
  > language, probably Norwegian.
 >
 > Several technical facts:
 >
 > 1. In the past there were several cases in which translators to  different
   languages
who reported common translation mistakes to me. I passed them  on
  to Yandex developers, with whom I communicate
quite regularly. They
 acknowledged receiving all of them. I am aware of at least one such  common
  mistake that was fixed; possibly there were more.
If you can give me a  list
  of such mistakes for Norwegian, I'll be very
happy to pass them on. I
 absolutely cannot promise that they will be fixed upstream, but it's
 possible.

 2. In Norwegian, Apertium is used for translating between the two  varieties
 > of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
 > languages. That's probably why it works so well—they are similar in
 > grammar, vocabulary, and narrative style (I'll pass it on to Apertium
 > developers—I'm sure they'll be happy to hear it). Unfortunately, 
machine

translation from English is not available in Apertium. Apertium works  best
  with very similar languages, and English has two
characteristics, which  are
 > unfortunate when combined: it is both the most popular source for
 > translation into almost all other languages (including Norwegian), and  it
   is not
_very_ similar to any other languages (except maybe Scots).  Machine
  translation from English into Norwegian is only
possible with Yandex at  the
 > moment. More engines may be added in the future, but at the moment 
that's
   all we
have. That's why disabling Yandex completely would indeed be a  lousy
  solution: A lot of people say that without
machine translation  integration
  Content Translation is useless. Not all users
think like that, but many  do.

 3. We can define a numerical threshold of acceptable percentage of  machine
  translation post-editing. Currently it's 75%.
It's a tad embarrassing,  but
 > it's hard-coded at the moment, but it can be very easily be made into a
 > variable per language. If the translator tries to publish a page in  which
   less than
that is modified, a warning will be shown.

 4. I'm not sure what do you mean by "language model". If it's any kind
 of a
 > linguistic engine, then it's definitely not within the resources that 
the
   Language
team itself can currently dedicate. However, if somebody who  knows
  Norwegian and some programming will write a
script that analyzes common  bad
  constructs in a Wikipedia dump, this will be very
useful. This would
 basically be an upgraded version of suggestion #1 above. (In my spare  time
  as a volunteer I'm doing something comparable
for Hebrew, although not  for
  translation, but for improving how MediaWiki link
trails work.)
 _______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> 
_______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> 
_______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] machine translation