On Sat, Jul 27, 2013 at 10:18 AM, David Cuenca dacuetu@gmail.com wrote:
Scott, "edit and maintain" parallelism sounds wonderful on paper, until you want to implement it and then you realize that you have to freeze changes both in the source text and in the target language for it to happen, which is, IMHO against the very nature of wikis.
Certainly not. As you yourself linked, there are 'fuzzy annotation' tools and other techniques. Just because one word in one language is changed, it shouldn't invalidate the entire parallelism. And the beauty of the statistical approach, if the changes are minor, you can still view the changed copy as a 'roughly parallel' text. After all, if I just replaced 'white' with 'pale', it doesn't necessarily mean that the translation 'blanco' is invalid. In fact, it adds more data points.
My main point was just that there is a chicken-and-egg problem here. You assume that machine translation can't work because we don't have enough parallel texts. But, to the extent that machine-aided translation of WP is successful, it creates a large amount of parallel text. I agree that there are challenges. I simply disagree, as a matter of logic, with the blanket dismissal of the chickens because there aren't yet any eggs.
Translate:Extension already does that in a way. I see it useful only for texts acting as a central hub for translations, like official communication. If that were to happen for all kind of content you would have to sacrifice the plurality of letting each wiki to do their own version.
I think you're attributing the faults of a single implementation/UX to the technique as a whole. (Which is why I felt that "step 1" should be to create better tools for maintaining information about parallel structures in the wikidata.)
The most popular statistical-based machine translation system has created its engine using texts extracted from *the whole internet*, it requires
...but its genesis used the UN corpora *only*. And in fact, the last paper I read (and please correct me if I'm wrong, I'm a dilettante not an expert) still claimed that the UN parallel text corpora was orders of magnitude more useful than the "whole internet data", because it had more reliable parallelism and was done by careful translators.
This is what WP has the potential to be.
huge processing power, and that without mentioning the amount of resources that went into research and development. And having all those resources they managed to create a system that sort of works. Wikipedia doesn't have enough amount of text nor resources to follow that route, and the target number of languages is even higher.
In a world with an active Moore's law, WP *does* have the computing power to approximate this effort. Again, the beauty of the statistical approach is that it scales.
Of course statistical-based approaches should also be used as well (point 8 of the proposed workflow), however more as a supporting technology rather than the main one.
I'm sure we can agree to disagree here. Probably our main differences are in answers to the question, "where should we start work"? I think annotating parallel texts is the most interesting research question ("research" because I agree that wiki editing by volunteers makes the UX problem nontrivial). I think your suggestion is to start work on the "semantic multilingual dictionary"?
I appreciate that you took time to read the proposal :)
And I certainly appreciate your effort to write the proposal and to work on the topic! --scott
ps. note that the inter-language links in the sidebar of wikipedia articles already comprise a very interesting corpus of noun translations. I don't think this dataset is currently exploited fully.