On Fri, Jul 26, 2013 at 11:30 PM, C. Scott Ananian
<cananian(a)wikimedia.org>wrote;wrote:
This statement seems rather defeatist to me. Step one
of a machine
translation effort should be to provide tools to annotate parallel texts in
the various wikis, and to edit and maintain their parallelism.
Scott, "edit and maintain" parallelism sounds wonderful on paper, until you
want to implement it and then you realize that you have to freeze changes
both in the source text and in the target language for it to happen, which
is, IMHO against the very nature of wikis.
Translate:Extension already does that in a way. I see it useful only for
texts acting as a central hub for translations, like official
communication. If that were to happen for all kind of content you would
have to sacrifice the plurality of letting each wiki to do their own
version.
Once this
is done, you have a substantial parallel corpora, which is then suitable to
grow the set of translated articles. That is, minority languages ought to
be accounted for by progressively expanding the number of translated
articles in their encyclopedia, as we do now. As this is done, machine
translation incrementally improves.
The most popular statistical-based machine translation system has created
its engine using texts extracted from *the whole internet*, it requires
huge processing power, and that without mentioning the amount of resources
that went into research and development. And having all those resources
they managed to create a system that sort of works.
Wikipedia doesn't have enough amount of text nor resources to follow that
route, and the target number of languages is even higher.
Of course statistical-based approaches should also be used as well (point 8
of the proposed workflow), however more as a supporting technology rather
than the main one.
If there is not enough of an editor
community to translate articles, I don't see how you will succeed in the
much more technically-demanding tasks of creating rules for a rule-based
translation system. The beauty of the statistical approach is that little
special ability is needed.
One single researcher can create working transfer rules for a language pair
in 3 months or less if there is previous work (see these GsoC [1], [2],
[3]). Whichever problem the translation has, it can be understood and
corrected. With statistics, you rely on bulk numbers and on the hope that
you have enough coverage, and that makes improving its defects even harder.
It is true that writing transfer rules is technically demanding, and so it
is writing mediawiki software, which keeps being developed anyways. After
seeing how their system works, I think there is room for simplifying
transfer rules (first storing them as mediawiki templates, then as linked
data, then having a user interface). That could lower the entry barrier for
linguists and translators alike, while enabling the triangulation of rules
between pairs that have a common one.
As said before, there is no single tool that can do everything, it is the
combination of them what will bring the best results. The good thing is
that there is no need to "marry" a technology, several can be developed in
parallel and broght to a point of convergence where they work together for
optimal results.
I appreciate that you took time to read the proposal :)
Thanks,
David
[1]
http://www.google-melange.com/gsoc/project/google/gsoc2013/akindalki/3001
[2]
http://www.google-melange.com/gsoc/project/google/gsoc2013/jcentelles/20001
[3]
http://www.google-melange.com/gsoc/project/google/gsoc2013/jonasfromseier/5…