[Wikimedia-l] [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy

Samuel Klein meta.sj at gmail.com
Sat Jul 27 18:00:21 UTC 2013


David - thanks for this proposal; it is something that deserves
attention, and our projects are already used as one of the raw sources
for machine translation efforts.

On Sat, Jul 27, 2013 at 10:18 AM, David Cuenca <dacuetu at gmail.com> wrote:
> On Fri, Jul 26, 2013 at 11:30 PM, C. Scott Ananian
> <cananian at wikimedia.org>wrote:
>
>> Step one of a machine
>> translation effort should be to provide tools to annotate parallel texts in
>> the various wikis, and to edit and maintain their parallelism.

I agree with most of Scott's input here.

> Scott, "edit and maintain" parallelism sounds wonderful on paper, until you
> want to implement it and then you realize that you have to freeze changes
> both in the source text and in the target language for it to happen, which
> is, IMHO against the very nature of wikis.

You don't need to freeze changes - you need permalinks to revisions,
the ability to track linkages between [sentences] in rev A.n in
language A and those in rev B.m in language B, and three-way diffs.
All are tractable problems.

> Translate:Extension already does that in a way. I see it useful only for
> texts acting as a central hub for translations, like official
> communication. If that were to happen for all kind of content you would
> have to sacrifice the plurality of letting each wiki to do their own version.

Allowing for a plurality of versions is useful.  There's no special
reason to break this out by language (if anything, there should be one
version per major cultural group - groups with different definitions
of reliable sources, for instance - not per language).  We should
separate "plurality of branches of a document" from "synchronizing
translations of a given branch" where a single branch of a document
should be available in any language.

For instance, I may want to read a French translation of the "Russian
WP" version of articles related to the Sino-Soviet war, in English --
in addition to the "Japanese WP" version, and the native "French WP"
version.  We can reduce the difficulty of translating each branch by
noting their shared similarities -- especially if we track the
revision at which each branched from, or rebased to, a shared trunk.
Allowing translators to automatically capture the source-revision when
carrying out an update via translation, per-page or per-section, would
make this easier.

> The most popular statistical-based machine translation system has created
> its engine using texts extracted from *the whole internet*, it requires
> huge processing power, and that without mentioning the amount of resources

One can do better with less power with parallel corpora.   WP and
Wikisource provide some of the closest things to a collection of
parallel corpora -- anything we can do to further clarify how much
these documents are parallel, and to improve their parallelism, will
improve [free] machine translation tools greatly.

> Of course statistical-based approaches should also be used as well (point 8
> of the proposed workflow), however more as a supporting technology rather
> than the main one.

+1

> One single researcher can create working transfer rules for a language pair
> in 3 months or less if there is previous work (see these GsoC [1], [2],
> [3]). Whichever problem the translation has, it can be understood and
> corrected...  [and] lower the entry barrier for linguists and translators alike,

Right.  It's much easier to get a rules-based system that is close
enough to be useful to human translators, to speed up their work and
lower the entry barrier for someone to start translating, than to do a
complete job with rules.

> that there is no need to "marry" a technology, several can be developed in
> parallel and broght to a point of convergence where they work together

+10

Warmly,
SJ



More information about the Wikimedia-l mailing list