Re: [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy

27 Jul 2013

On Sat, Jul 27, 2013 at 10:18 AM, David Cuenca &lt;dacuetu(a)gmail.com&gt; wrote:

...
  Scott, "edit and maintain" parallelism
sounds wonderful on paper, until you
 want to implement it and then you realize that you have to freeze changes
 both in the source text and in the target language for it to happen, which
 is, IMHO against the very nature of wikis.

Certainly not.  As you yourself linked, there are 'fuzzy annotation' tools
and other techniques.  Just because one word in one language is changed, it
shouldn't invalidate the entire parallelism.  And the beauty of the
statistical approach, if the changes are minor, you can still view the
changed copy as a 'roughly parallel' text.  After all, if I just replaced
'white' with 'pale', it doesn't necessarily mean that the translation
'blanco' is invalid. In fact, it adds more data points.

My main point was just that there is a chicken-and-egg problem here.  You
assume that machine translation can't work because we don't have enough
parallel texts.  But, to the extent that machine-aided translation of WP is
successful, it creates a large amount of parallel text.   I agree that
there are challenges.  I simply disagree, as a matter of logic, with the
blanket dismissal of the chickens because there aren't yet any eggs.

...
  Translate:Extension already does that in a way. I see
it useful only for
 texts acting as a central hub for translations, like official
 communication. If that were to happen for all kind of content you would
 have to sacrifice the plurality of letting each wiki to do their own
 version.

I think you're attributing the faults of a single implementation/UX to the
technique as a whole.  (Which is why I felt that "step 1" should be to
create better tools for maintaining information about parallel structures
in the wikidata.)

...
  The most popular statistical-based machine translation
system has created
 its engine using texts extracted from *the whole internet*, it requires

...but its genesis used the UN corpora *only*.  And in fact, the last paper
I read (and please correct me if I'm wrong, I'm a dilettante not an expert)
still claimed that the UN parallel text corpora was orders of magnitude
more useful than the "whole internet data", because it had more reliable
parallelism and was done by careful translators.

This is what WP has the potential to be.

...
  huge processing power, and that without mentioning the
amount of resources
 that went into research and development. And having all those resources
 they managed to create a system that sort of works.
 Wikipedia doesn't have enough amount of text nor resources to follow that
 route, and the target number of languages is even higher.

In a world with an active Moore's law, WP *does* have the computing power
to approximate this effort.  Again, the beauty of the statistical approach
is that it scales.

...
  Of course statistical-based approaches should also be
used as well (point 8
 of the proposed workflow), however more as a supporting technology rather
 than the main one.

I'm sure we can agree to disagree here.  Probably our main differences are
in answers to the question, "where should we start work"?  I think
annotating parallel texts is the most interesting research question
("research" because I agree that wiki editing by volunteers makes the UX
problem nontrivial).  I think your suggestion is to start work on the
"semantic multilingual dictionary"?

...
  I appreciate that you took time to read the proposal
:)

And I certainly appreciate your effort to write the proposal and to work on
the topic!
 --scott

ps. note that the inter-language links in the sidebar of wikipedia articles
already comprise a very interesting corpus of noun translations.  I don't
think this dataset is currently exploited fully.

-- 
(http://cscott.net)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy