On Fri, Jul 26, 2013 at 11:30 PM, C. Scott Ananian cananian@wikimedia.orgwrote:
This statement seems rather defeatist to me. Step one of a machine translation effort should be to provide tools to annotate parallel texts in the various wikis, and to edit and maintain their parallelism.
Scott, "edit and maintain" parallelism sounds wonderful on paper, until you want to implement it and then you realize that you have to freeze changes both in the source text and in the target language for it to happen, which is, IMHO against the very nature of wikis. Translate:Extension already does that in a way. I see it useful only for texts acting as a central hub for translations, like official communication. If that were to happen for all kind of content you would have to sacrifice the plurality of letting each wiki to do their own version.
Once this is done, you have a substantial parallel corpora, which is then suitable to grow the set of translated articles. That is, minority languages ought to be accounted for by progressively expanding the number of translated articles in their encyclopedia, as we do now. As this is done, machine translation incrementally improves.
The most popular statistical-based machine translation system has created its engine using texts extracted from *the whole internet*, it requires huge processing power, and that without mentioning the amount of resources that went into research and development. And having all those resources they managed to create a system that sort of works. Wikipedia doesn't have enough amount of text nor resources to follow that route, and the target number of languages is even higher. Of course statistical-based approaches should also be used as well (point 8 of the proposed workflow), however more as a supporting technology rather than the main one.
If there is not enough of an editor community to translate articles, I don't see how you will succeed in the much more technically-demanding tasks of creating rules for a rule-based translation system. The beauty of the statistical approach is that little special ability is needed.
One single researcher can create working transfer rules for a language pair in 3 months or less if there is previous work (see these GsoC [1], [2], [3]). Whichever problem the translation has, it can be understood and corrected. With statistics, you rely on bulk numbers and on the hope that you have enough coverage, and that makes improving its defects even harder. It is true that writing transfer rules is technically demanding, and so it is writing mediawiki software, which keeps being developed anyways. After seeing how their system works, I think there is room for simplifying transfer rules (first storing them as mediawiki templates, then as linked data, then having a user interface). That could lower the entry barrier for linguists and translators alike, while enabling the triangulation of rules between pairs that have a common one.
As said before, there is no single tool that can do everything, it is the combination of them what will bring the best results. The good thing is that there is no need to "marry" a technology, several can be developed in parallel and broght to a point of convergence where they work together for optimal results.
I appreciate that you took time to read the proposal :)
Thanks, David
[1] http://www.google-melange.com/gsoc/project/google/gsoc2013/akindalki/3001 [2] http://www.google-melange.com/gsoc/project/google/gsoc2013/jcentelles/20001 [3] http://www.google-melange.com/gsoc/project/google/gsoc2013/jonasfromseier/50...