On Sat, Jul 27, 2013 at 10:18 AM, David Cuenca <dacuetu(a)gmail.com> wrote:
Scott, "edit and maintain" parallelism
sounds wonderful on paper, until you
want to implement it and then you realize that you have to freeze changes
both in the source text and in the target language for it to happen, which
is, IMHO against the very nature of wikis.
Certainly not. As you yourself linked, there are 'fuzzy annotation' tools
and other techniques. Just because one word in one language is changed, it
shouldn't invalidate the entire parallelism. And the beauty of the
statistical approach, if the changes are minor, you can still view the
changed copy as a 'roughly parallel' text. After all, if I just replaced
'white' with 'pale', it doesn't necessarily mean that the translation
'blanco' is invalid. In fact, it adds more data points.
My main point was just that there is a chicken-and-egg problem here. You
assume that machine translation can't work because we don't have enough
parallel texts. But, to the extent that machine-aided translation of WP is
successful, it creates a large amount of parallel text. I agree that
there are challenges. I simply disagree, as a matter of logic, with the
blanket dismissal of the chickens because there aren't yet any eggs.
Translate:Extension already does that in a way. I see
it useful only for
texts acting as a central hub for translations, like official
communication. If that were to happen for all kind of content you would
have to sacrifice the plurality of letting each wiki to do their own
version.
I think you're attributing the faults of a single implementation/UX to the
technique as a whole. (Which is why I felt that "step 1" should be to
create better tools for maintaining information about parallel structures
in the wikidata.)
The most popular statistical-based machine translation
system has created
its engine using texts extracted from *the whole internet*, it requires
...but its genesis used the UN corpora *only*. And in fact, the last paper
I read (and please correct me if I'm wrong, I'm a dilettante not an expert)
still claimed that the UN parallel text corpora was orders of magnitude
more useful than the "whole internet data", because it had more reliable
parallelism and was done by careful translators.
This is what WP has the potential to be.
huge processing power, and that without mentioning the
amount of resources
that went into research and development. And having all those resources
they managed to create a system that sort of works.
Wikipedia doesn't have enough amount of text nor resources to follow that
route, and the target number of languages is even higher.
In a world with an active Moore's law, WP *does* have the computing power
to approximate this effort. Again, the beauty of the statistical approach
is that it scales.
Of course statistical-based approaches should also be
used as well (point 8
of the proposed workflow), however more as a supporting technology rather
than the main one.
I'm sure we can agree to disagree here. Probably our main differences are
in answers to the question, "where should we start work"? I think
annotating parallel texts is the most interesting research question
("research" because I agree that wiki editing by volunteers makes the UX
problem nontrivial). I think your suggestion is to start work on the
"semantic multilingual dictionary"?
I appreciate that you took time to read the proposal
:)
And I certainly appreciate your effort to write the proposal and to work on
the topic!
--scott
ps. note that the inter-language links in the sidebar of wikipedia articles
already comprise a very interesting corpus of noun translations. I don't
think this dataset is currently exploited fully.
--
(
http://cscott.net)