Re: [Wikimedia-l] machine translation

3 May 2017

...

 More seriously, it's quite possible that they already used some of the
 translations made by the Norwegian Wikipedia community. In addition to
 being published as an article, each translated paragraph is saved into
 parallel corpora, and machine translation developers read the edited text
 and use it to improve their software. This is completely open and usable by
 all machine translation developers, not only for Yandex. 

It is quite possible the Yandex people has done something as the
translations are a lot better now than previously. It also imply that it is
really important to correct the text inside CT.

The question is how would we do it with our software. I simply cannot
...
  imagine doing it with the current MediaWiki platform,
unless we develop a
 sophisticated NLP engine, although it's possible I'm exaggerating or
 forgetting something. 

There are several places this can be inserted, both in VE and in MW. What I
want is a kind of rather simple language model, but Aharoni proposed
Languagetools in private communication. That lib is very interesting.

Perhaps one day some AI/machine-learning system like ORES would be able to
...
  do it. Maybe it could be an extension to ORES itself.

I've seen language models implemented as neural nets, but it is not
necessary to do it like that. Actually it is more common to do it with
plain statistics.

On Tue, May 2, 2017 at 9:25 PM, Amir E. Aharoni <
amir.aharoni(a)mail.huji.ac.il&gt; wrote:

> 2017-05-02 21:47 GMT+03:00 John Erling Blad &lt;jeblad(a)gmail.com&gt;om>:
>
> > Yandex as a general translation engine to be able to read some alien
> > language is quite good, but as an engine to produce written text it is
> not
> > very good at all.
>
>
> ... Nor is it supposed to be.
>
> A translator is a person. Machine translation software is not a person,
> it's software. It's a tool that is supposed to help a human translator
> produce a good written text more quickly. If it doesn't make this work
> faster, it can and should be disabled. If no translator
>
>
> > In fact it often creates quite horrible Norwegian, even
> > for closely related languages. One quite common problem is reordering of
> > words into meaningless constructs, an other problem is reordering lexical
> > gender in weird ways. The English preposition "a" is often translated
as
> > "en" in a propositional phrase, and then the gender is added to the
> > following phrase. That gives a translation of  "Oppland is a county
in…"
> >  into something like "Oppland er en fylket i…" This should be
"Oppland er
> > et fylke i…".
> >
>
> I suggest making a page with a list of such examples, so that the machine
> translation developers could read it.
>
>
> > (I just checked and it seems like Yandex messes up a lot less now than
> > previously, but it is still pretty bad.)
> >
>
> I guess that this is something that Yandex developers will be happy to hear
> :)
...

 More seriously, it's quite possible that they already used some of the
 translations made by the Norwegian Wikipedia community. In addition to
 being published as an article, each translated paragraph is saved into
 parallel corpora, and machine translation developers read the edited text
 and use it to improve their software. This is completely open and usable by
 all machine translation developers, not only for Yandex. >
>
>
> > The numerical threshold does not work. The reason is simple, the number
> of
> > fixes depends on language constructs that fails, and that is simply not a
> > constant for small text fragments. Perhaps if we could flag specific
> > language constructs that is known to give a high percentage of failures,
> > and if the translator must check those sentences. One such language
> > construct is disappearances between the preposition and the gender of the
> > following term in a prepositional phrase.
> >
>
> The question is how would we do it with our software. I simply cannot
...
  imagine doing it with the current MediaWiki platform,
unless we develop a
 sophisticated NLP engine, although it's possible I'm exaggerating or
 forgetting something. >
>
> > A language model could be a statistical model for the language itself,
> not
> > for the translation into that language. We don't want a perfect language
> > model, but a sufficient language model to mark weird constructs. A very
> > simple solution could simply be to mark tri-grams that does not  already
> > exist in the text base for the destination as possible errors. It is not
> > necessary to do a live check, but  at least do it before the page can be
> > saved.
> >
>
> See above—we don't have support for plugging something like that into our
> workflow.
>
> Perhaps one day some AI/machine-learning system like ORES would be able to
...
  do it. Maybe it could be an extension to ORES itself.
>
>
> > Note the difference in what Yandex do and what we want to achieve; Yandex
> > translates a text between two different languages, without any clear
> reason
> > why. It is not to important if there are weird constructs in the text, as
> > long as it is usable in "some" context. We translate a text for the
> purpose
> > of republishing it. The text should be usable and easily readable in that
> > language.
> >
>
> This is a well-known problem in machine translation: domain.
>
> Professional industrial translation powerhouses use internally-customized
> machine translation engines that specialize on particular domains, such as
> medicine, law, or news. In theory, it would make a lot of sense to have a
> customized machine translation engine for encyclopedic articles, or maybe
> even for several different styles of encyclopedic articles (biography,
> science, history, etc.). For now what we have is a very general-purpose
> consumer-oriented engine. I hope it changes in the future.
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] machine translation