[Wikimedia-l] machine translation

2 May 2017


      2017-05-02 18:20 GMT+03:00 John Erling Blad jeblad@gmail.com:
...
Brute force solution; turn the ContentTranslation off. Really stupid
solution.
... Then I guess you don't mind that I'm changing the thread name :)
...
The next solution; turn the Yandex engine off. That would solve a
part of the problem. Kind of lousy solution though.
...
What about adding a language model that warns when the language constructs
gets to weird? It is like a "test" for the translation. The CT is used for
creating a translation, but the language model is used for verifying if the
translation is good enough. If it does not validate against the language
model it should simply not be published to the main name space. It will
still be possible to create a draft, but then the user is completely aware
that the translation isn't good enough.
Such a language model should be available as a test for any article, as it
can be used as a quality measure for the article. It is really a quantity
measure for the well-spokenness of the article, but that isn't quite so
intuitive.
So, I'll allow myself to guess that you are talking about one particular
language, probably Norwegian.
Several technical facts:
1. In the past there were several cases in which translators to different
languages who reported common translation mistakes to me. I passed them on
to Yandex developers, with whom I communicate quite regularly. They
acknowledged receiving all of them. I am aware of at least one such common
mistake that was fixed; possibly there were more. If you can give me a list
of such mistakes for Norwegian, I'll be very happy to pass them on. I
absolutely cannot promise that they will be fixed upstream, but it's
possible.
2. In Norwegian, Apertium is used for translating between the two varieties
of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
languages. That's probably why it works so well—they are similar in
grammar, vocabulary, and narrative style (I'll pass it on to Apertium
developers—I'm sure they'll be happy to hear it). Unfortunately, machine
translation from English is not available in Apertium. Apertium works best
with very similar languages, and English has two characteristics, which are
unfortunate when combined: it is both the most popular source for
translation into almost all other languages (including Norwegian), and it
is not _very_ similar to any other languages (except maybe Scots). Machine
translation from English into Norwegian is only possible with Yandex at the
moment. More engines may be added in the future, but at the moment that's
all we have. That's why disabling Yandex completely would indeed be a lousy
solution: A lot of people say that without machine translation integration
Content Translation is useless. Not all users think like that, but many do.
3. We can define a numerical threshold of acceptable percentage of machine
translation post-editing. Currently it's 75%. It's a tad embarrassing, but
it's hard-coded at the moment, but it can be very easily be made into a
variable per language. If the translator tries to publish a page in which
less than that is modified, a warning will be shown.
4. I'm not sure what do you mean by "language model". If it's any kind of a
linguistic engine, then it's definitely not within the resources that the
Language team itself can currently dedicate. However, if somebody who knows
Norwegian and some programming will write a script that analyzes common bad
constructs in a Wikipedia dump, this will be very useful. This would
basically be an upgraded version of suggestion #1 above. (In my spare time
as a volunteer I'm doing something comparable for Hebrew, although not for
translation, but for improving how MediaWiki link trails work.)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] machine translation