Re: [Wikimedia-l] machine translation

3 May 2017

Hello,
This seems to me like a social problem, rather than a technical one.
Shutting down the tool would be a disadvantage for those people who benefit
from the tool and do good things with it.
What is the general opinion among the Norwegians about this issue? Is there
consent about how to deal with this kind of "articles"? If most people
agree they should be speedy-deleted, this would be a useful deterrence for
those who are not careful enough when using the tool?
Kind regards
Ziko

2017-05-03 13:22 GMT+02:00 John Erling Blad &lt;jeblad(a)gmail.com&gt;om>:

...
  Agree! I also wonder if translators adapt to specific
errors if they are
 repeated to often. I wonder if it works like priming the brain to a
 specific pattern.

 On Wed, May 3, 2017 at 1:15 PM, Lodewijk &lt;lodewijk(a)effeietsanders.org&gt;
 wrote:

  Reading this, I get a strong impression the
problem may very well be in
 setting expectations for the users of this translation tool. If they  expect
  the automated translation to be rather good, they
may get fed up more
 easily than when they consider it primarily a glorified dictionary.

 Lodewijk

 On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela &lt;dacuetu(a)gmail.com&gt;
 wrote:

  Perhaps it would be a good idea to compare the
translated text to the  text
 > that the user wants to save.
 >
 > If they are more than 95% the same, that means that the user didn't  take
   the
effort to correct the text.

 Cheers,
 Micru

 On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich &lt;wpedzich(a)gmail.com&gt;
 wrote:

  It does depend a lot on the engagement level of
the human behind the
 keyboard. When I deal with machine-translated text, I simply wonder  whether
 > the someone behind the keyboard took efforts to actually read the  piece.
 > >
 > > Now whether this would work if limited to namespaces outside "main"
 - I
   do
 > not want to demonise the issue, but if the person submitting the text  for
 > > machine translation does not read it, what will stop them from a  quick
  > > ctrl+c / ctrl+v? Just asking.
 > >
 > > Wojciech
 > >
 > > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
 > >
 > > Creating machine translations only in the draft space (or in the user
 > space
 > >> in the projects which do not have draft) could help.
 > >>
 > >> Cheers
 > >> Yaroslav
 > >>
 > >> On Tue, May 2, 2017 at 10:16 PM, Pharos < 
pharosofalexandria(a)gmail.com
  >
 > >> wrote:
 > >>
 > >> I think it all depends on the level of engagement of the human
 > translator.
 > >>>
 > >>> When the tool is used in the right way, it is a fantastic tool.
 > >>>
 > >>> Maybe we can find better methods to nudge people toward taking 
their
   time
 >>> and really doing work on their translations.
 >>>
 >>> Thanks,
 >>> Pharos
 >>>
 >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
 >>> bodhisattwa.rgkmc(a)gmail.com&gt; wrote:
 >>>
 >>> Content translation with Yandex is also a problem in Bengali 
Wikipedia.
  >>>> Some users have grown a tendency
to create machine translated
 >>>> meaningless
 >>>> articles with this extension to increase edit count and article 
count.
 > >>>>
 > >>> This
 > >>>
 > >>>> has increased the workloads of admins to find and delete those
 > articles.
 > >>>>
 > >>>> Yandex is not ready for many languages and it is better to shut
 it.
  We
  >>>> don't need it in Bengali.
 >>>>
 >>>> Regards
 >>>> On May 3, 2017 12:17 AM, "John Erling Blad"
&lt;jeblad(a)gmail.com&gt;  wrote:
  >>>>
 >>>> Actually this _is_ about turning ContentTranslation off, that is 
what
  >>>>> several users in the
community want. They block people using the
 >>>>>
 >>>> extension
 >>>>
 >>>>> and delete the translated articles. Use of ContentTranslation has
 >>>>>
 >>>> become
 >>>
 >>>> a
 >>>>
 >>>>>   rather contentious case.
 >>>>>
 >>>>> Yandex as a general translation engine to be able to read some
 alien
 > >>>>> language is quite good, but as an engine to produce written
text  it
   is
 >>>>>
 >>>> not
 >>>>
 >>>>> very good at all. In fact it often creates quite horrible 
Norwegian,
  >>>>>
 >>>> even
 >>>
 >>>> for closely related languages. One quite common problem is 
reordering
  >>>>>
 >>>> of
 >>>
 >>>> words into meaningless constructs, an other problem is reordering
 >>>>>
 >>>> lexical
 >>>
 >>>> gender in weird ways. The English preposition "a" is often
 translated
 > >>>>>
 > >>>> as
 > >>>
 > >>>> "en" in a propositional phrase, and then the gender is
added to  the
  > >>>>> following phrase. That
gives a translation of  "Oppland is a  county
  > >>>>>
 > >>>> in…"
 > >>>
 > >>>>   into something like "Oppland er en fylket i…" This
should be
 > "Oppland
 > >>>>>
 > >>>> er
 > >>>
 > >>>> et fylke i…".
 > >>>>>
 > >>>>> (I just checked and it seems like Yandex messes up a lot less
now
 > than
 > >>>>> previously, but it is still pretty bad.)
 > >>>>>
 > >>>>> Apertium works because the language is closely related, Yandex
 does
   not
 >>>>> work because it is used between very different languages. People
 try
  to
 >>>>>
 >>>> use
 >>>>
 >>>>> Yandex and gets disappointed, and falsely conclude that all 
language
  >>>>> translations are equally
weird. They are not, but Yandex  translations
 > >>>>>
 > >>>> are
 > >>>
 > >>>> weird.
 > >>>>>
 > >>>>> The numerical threshold does not work. The reason is simple,
the
 > number
 > >>>>>
 > >>>> of
 > >>>>
 > >>>>> fixes depends on language constructs that fails, and that is
 simply
  > >>>>>
 > >>>> not a
 > >>>
 > >>>> constant for small text fragments. Perhaps if we could flag 
specific
  > >>>>> language constructs
that is known to give a high percentage of
 > >>>>>
 > >>>> failures,
 > >>>
 > >>>> and if the translator must check those sentences. One such 
language
  > >>>>> construct is
disappearances between the preposition and the  gender
  of

>>>>
>>> the
>>
>>> following term in a prepositional phrase. If they are not similar, 
then
 >>>>>
 >>>> the
 >>>>
 >>>>> sentence must be checked. It is not always wrong to write "en
 jenta"
 > in
 > >>>>> Norwegian, but it is likely to be wrong.
 > >>>>>
 > >>>>> A language model could be a statistical model for the language
 > itself,
 > >>>>>
 > >>>> not
 > >>>>
 > >>>>> for the translation into that language. We don't want a
perfect
 > >>>>>
 > >>>> language
 > >>>
 > >>>> model, but a sufficient language model to mark weird constructs. A
 > very
 > >>>>> simple solution could simply be to mark tri-grams that does
not
 > >>>>>
 > >>>> already
 > >>>
 > >>>> exist in the text base for the destination as possible errors. It
 is

>>>>>
 >>>> not
 >>>
 >>>> necessary to do a live check, but  at least do it before the page 
can
 > >>>>>
 > >>>> be
 > >>>
 > >>>> saved.
 > >>>>>
 > >>>>> Note the difference in what Yandex do and what we want to
 achieve;
  > >>>>>
 > >>>> Yandex
 > >>>
 > >>>> translates a text between two different languages, without any
 clear
  > >>>>>
 > >>>> reason
 > >>>>
 > >>>>> why. It is not to important if there are weird constructs in
the
 > text,
 > >>>>>
 > >>>> as
 > >>>
 > >>>> long as it is usable in "some" context. We translate a
text for  the

>>>>>
 >>>> purpose
 >>>>
 >>>>> of republishing it. The text should be usable and easily readable
 in
 > >>>>>
 > >>>> that
 > >>>
 > >>>> language.
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
 > >>>>> amir.aharoni(a)mail.huji.ac.il&gt; wrote:
 > >>>>>
 > >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad
&lt;jeblad(a)gmail.com&gt;om>:
 > >>>>>>
 > >>>>>> Brute force solution; turn the ContentTranslation off.
Really
 > >>>>>>>
 > >>>>>> stupid
 > >>>
 > >>>> solution.
 > >>>>>>>
 > >>>>>>
 > >>>>>> ... Then I guess you don't mind that I'm changing
the thread  name
  :)
 > >>>>>>
 > >>>>>>
 > >>>>>> The next solution; turn the Yandex engine off. That would
solve  a
  > >>>>>>> part of the
problem. Kind of lousy solution though.
 > >>>>>>>
 > >>>>>>> What about adding a language model that warns when the
language
 > >>>>>>>
 > >>>>>> constructs
 > >>>>>>
 > >>>>>>> gets to weird? It is like a "test" for the
translation. The CT  is
  > >>>>>>>
 > >>>>>> used
 > >>>>
 > >>>>> for
 > >>>>>>
 > >>>>>>> creating a translation, but the language model is used
for
 > >>>>>>>
 > >>>>>> verifying
 > >>>
 > >>>> if
 > >>>>
 > >>>>> the
 > >>>>>>
 > >>>>>>> translation is good enough. If it does not validate
against the
 > >>>>>>>
 > >>>>>> language
 > >>>>>
 > >>>>>> model it should simply not be published to the main name
space.  It
  > >>>>>>>
 > >>>>>> will
 > >>>>
 > >>>>> still be possible to create a draft, but then the user is
 > >>>>>>>
 > >>>>>> completely
 > >>>
 > >>>> aware
 > >>>>>>
 > >>>>>>> that the translation isn't good enough.
 > >>>>>>>
 > >>>>>>> Such a language model should be available as a test for
any
 > >>>>>>>
 > >>>>>> article,
 > >>>
 > >>>> as
 > >>>>
 > >>>>> it
 > >>>>>>
 > >>>>>>> can be used as a quality measure for the article. It is
really  a
  > >>>>>>>
 > >>>>>> quantity
 > >>>>>
 > >>>>>> measure for the well-spokenness of the article, but that
isn't
 > >>>>>>>
 > >>>>>> quite
 > >>>
 > >>>> so
 > >>>>
 > >>>>> intuitive.
 > >>>>>>>
 > >>>>>>> So, I'll allow myself to guess that you are talking
about one
 > >>>>>>
 > >>>>> particular
 > >>>>
 > >>>>> language, probably Norwegian.
 > >>>>>>
 > >>>>>> Several technical facts:
 > >>>>>>
 > >>>>>> 1. In the past there were several cases in which
translators to
 > >>>>>>
 > >>>>> different
 > >>>>
 > >>>>> languages who reported common translation mistakes to me. I
 passed

>>>>>>
 >>>>> them
 >>>
 >>>> on
 >>>>>
 >>>>>> to Yandex developers, with whom I communicate quite regularly.
 They
  >>>>>> acknowledged receiving
all of them. I am aware of at least one  such
 > >>>>>>
 > >>>>> common
 > >>>>>
 > >>>>>> mistake that was fixed; possibly there were more. If you
can  give
  me
  >>>>>>
 >>>>> a
 >>>
 >>>> list
 >>>>>
 >>>>>> of such mistakes for Norwegian, I'll be very happy to pass
them  on.
  I
 >>>>>> absolutely cannot promise that they will be fixed upstream, but
 it's
 > >>>>>> possible.
 > >>>>>>
 > >>>>>> 2. In Norwegian, Apertium is used for translating between
the  two
   >>>>>
>>>> varieties
>>>>
>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other 
Scandinavian
 >>>>>> languages. That's probably why it works so well—they are
similar  in

>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
 Apertium
 >>>>>> developers—I'm sure they'll be happy to hear it).
Unfortunately,
 >>>>>>
 >>>>> machine
 >>>>
 >>>>> translation from English is not available in Apertium. Apertium
 works
 > >>>>>>
 > >>>>> best
 > >>>>>
 > >>>>>> with very similar languages, and English has two 
characteristics,
  > >>>>>>
 > >>>>> which
 > >>>
 > >>>> are
 > >>>>>
 > >>>>>> unfortunate when combined: it is both the most popular
source  for
  > >>>>>> translation into
almost all other languages (including  Norwegian),
  > >>>>>>
 > >>>>> and
 > >>>
 > >>>> it
 > >>>>
 > >>>>> is not _very_ similar to any other languages (except maybe
 Scots).

>>>>>>
 >>>>> Machine
 >>>>>
 >>>>>> translation from English into Norwegian is only possible with
 Yandex
 > >>>>>>
 > >>>>> at
 > >>>
 > >>>> the
 > >>>>>
 > >>>>>> moment. More engines may be added in the future, but at the
 moment

>>>>>>
 >>>>> that's
 >>>>
 >>>>> all we have. That's why disabling Yandex completely would indeed
 be a
 > >>>>>>
 > >>>>> lousy
 > >>>>>
 > >>>>>> solution: A lot of people say that without machine
translation
 > >>>>>>
 > >>>>> integration
 > >>>>>
 > >>>>>> Content Translation is useless. Not all users think like
that,  but
  > >>>>>>
 > >>>>> many
 > >>>
 > >>>> do.
 > >>>>>
 > >>>>>> 3. We can define a numerical threshold of acceptable
percentage  of
  > >>>>>>
 > >>>>> machine
 > >>>>>
 > >>>>>> translation post-editing. Currently it's 75%. It's
a tad
 > >>>>>>
 > >>>>> embarrassing,
 > >>>
 > >>>> but
 > >>>>>
 > >>>>>> it's hard-coded at the moment, but it can be very
easily be made
 > >>>>>>
 > >>>>> into a
 > >>>
 > >>>> variable per language. If the translator tries to publish a page
 in
  > >>>>>>
 > >>>>> which
 > >>>>
 > >>>>> less than that is modified, a warning will be shown.
 > >>>>>>
 > >>>>>> 4. I'm not sure what do you mean by "language
model". If it's  any
   >>>>>
>>>> kind
>>
>>> of a
>>>>
>>>>> linguistic engine, then it's definitely not within the resources
 that
 >>>>>>
 >>>>> the
 >>>>
 >>>>> Language team itself can currently dedicate. However, if somebody
 who
  >>>>>>
 >>>>> knows
 >>>>>
 >>>>>> Norwegian and some programming will write a script that
analyzes
 >>>>>>
 >>>>> common
 >>>
 >>>> bad
 >>>>>
 >>>>>> constructs in a Wikipedia dump, this will be very useful. This
 would
 > >>>>>> basically be an upgraded version of suggestion #1 above.
(In my
 > spare
 > >>>>>>
 > >>>>> time
 > >>>>>
 > >>>>>> as a volunteer I'm doing something comparable for
Hebrew,  although
   >>>>>
>>>> not
>>
>>> for
>>>>
>>>>> translation, but for improving how MediaWiki link trails work.)
>>>>> _______________________________________________
>>>>> Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/
 >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 >>>>>> wiki/Wikimedia-l
 >>>>>> New messages to: Wikimedia-l(a)lists.wikimedia.org
 >>>>>> Unsubscribe: https://lists.wikimedia.org/
 >>>>>>
 >>>>> mailman/listinfo/wikimedia-l,
 >>>
 >>>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject= 
unsubscribe>
  >>>>>>
 >>>>> _______________________________________________
 >>>>> Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/

>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>> wiki/Wikimedia-l
>>>> New messages to: Wikimedia-l(a)lists.wikimedia.org
>>>> Unsubscribe: https://lists.wikimedia.org/ 
mailman/listinfo/wikimedia-l,
 >>>>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject= 
unsubscribe>
  >>>>>
 >>>> _______________________________________________
 >>>> Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/
  >>>> wiki/Mailing_lists/Guidelines
and https://meta.wikimedia.org/
 >>>> wiki/Wikimedia-l
 >>>> New messages to: Wikimedia-l(a)lists.wikimedia.org
 >>>> Unsubscribe: https://lists.wikimedia.org/ 
mailman/listinfo/wikimedia-l
  ,
 >>>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject= 
unsubscribe>
 > >>>>
 > >>>> _______________________________________________
 > >>> Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/

>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 >>> wiki/Wikimedia-l
 >>> New messages to: Wikimedia-l(a)lists.wikimedia.org
 >>> Unsubscribe: https://lists.wikimedia.org/ 
mailman/listinfo/wikimedia-l,
 > >>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject= 
unsubscribe>

>>>
 >>> _______________________________________________
 >> Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wik
 > >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
 > >> i/Wikimedia-l
 > >> New messages to: Wikimedia-l(a)lists.wikimedia.org
 > >> Unsubscribe: https://lists.wikimedia.org/ 
mailman/listinfo/wikimedia-l
  ,
 > >> <mailto:wikimedia-l-request@lists.wikimedia.org?subject= 
unsubscribe>
   >>
 >
 >
 >
 > _______________________________________________
 > Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wik
 > > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 > wiki/Wikimedia-l
 > > New messages to: Wikimedia-l(a)lists.wikimedia.org
 > > Unsubscribe: https://lists.wikimedia.org/  mailman/listinfo/wikimedia-l,

<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

 --
 Etiamsi omnes, ego non
 _______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
  _______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
  _______________________________________________
 Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
 wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
 wiki/Wikimedia-l
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] machine translation