[Wikimedia-l] [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy

David Cuenca dacuetu at gmail.com
Sat Jul 27 14:18:08 UTC 2013


On Fri, Jul 26, 2013 at 11:30 PM, C. Scott Ananian
<cananian at wikimedia.org>wrote:

> This statement seems rather defeatist to me.  Step one of a machine
> translation effort should be to provide tools to annotate parallel texts in
> the various wikis, and to edit and maintain their parallelism.


Scott, "edit and maintain" parallelism sounds wonderful on paper, until you
want to implement it and then you realize that you have to freeze changes
both in the source text and in the target language for it to happen, which
is, IMHO against the very nature of wikis.
Translate:Extension already does that in a way. I see it useful only for
texts acting as a central hub for translations, like official
communication. If that were to happen for all kind of content you would
have to sacrifice the plurality of letting each wiki to do their own
version.


> Once this
> is done, you have a substantial parallel corpora, which is then suitable to
> grow the set of translated articles.  That is, minority languages ought to
> be accounted for by progressively expanding the number of translated
> articles in their encyclopedia, as we do now.  As this is done, machine
> translation incrementally improves.


The most popular statistical-based machine translation system has created
its engine using texts extracted from *the whole internet*, it requires
huge processing power, and that without mentioning the amount of resources
that went into research and development. And having all those resources
they managed to create a system that sort of works.
Wikipedia doesn't have enough amount of text nor resources to follow that
route, and the target number of languages is even higher.
Of course statistical-based approaches should also be used as well (point 8
of the proposed workflow), however more as a supporting technology rather
than the main one.


> If there is not enough of an editor
> community to translate articles, I don't see how you will succeed in the
> much more technically-demanding tasks of creating rules for a rule-based
> translation system.  The beauty of the statistical approach is that little
> special ability is needed.


One single researcher can create working transfer rules for a language pair
in 3 months or less if there is previous work (see these GsoC [1], [2],
[3]). Whichever problem the translation has, it can be understood and
corrected. With statistics, you rely on bulk numbers and on the hope that
you have enough coverage, and that makes improving its defects even harder.
It is true that writing transfer rules is technically demanding, and so it
is writing mediawiki software, which keeps being developed anyways. After
seeing how their system works, I think there is room for simplifying
transfer rules (first storing them as mediawiki templates, then as linked
data, then having a user interface). That could lower the entry barrier for
linguists and translators alike, while enabling the triangulation of rules
between pairs that have a common one.

As said before, there is no single tool that can do everything, it is the
combination of them what will bring the best results. The good thing is
that there is no need to "marry" a technology, several can be developed in
parallel and broght to a point of convergence where they work together for
optimal results.

I appreciate that you took time to read the proposal :)

Thanks,
David

[1]
http://www.google-melange.com/gsoc/project/google/gsoc2013/akindalki/3001
[2]
http://www.google-melange.com/gsoc/project/google/gsoc2013/jcentelles/20001
[3]
http://www.google-melange.com/gsoc/project/google/gsoc2013/jonasfromseier/5001


More information about the Wikimedia-l mailing list