[Wikimedia-l] Human-assisted machine translation (it was: "The case for supporting open source machine translation")

Wed May 1 09:42:34 UTC 2013

Re David's point "One of the biggest problems in MT is word
disambiguation." I'm sure that this would be correct, though probably more
strongly in some languages than in others. English for example has many
words with meanings as diverse as bonnet (headware, a type of chili and in
some countries part of a car), faggot (meatballs, a bundle of firewood and
a pejorative term).

I'd add a further problem is that trivial almost cosmetic errors in one
language can become magnified by machine translation. For example one of my
contributions to the English Wikipedia is to hunt for the sort of typos
that a spellchecker won't pickup. I've abolished the entire Olympic Sport
of synchronised ventriloquism (the throwing of discusses), reduced the
choice of Olympic medals from four to three by eliminating sliver medals
and dramatically reduced the number of actors who have been staring in
various films. My assumption is that after machine translation staring and
starring, like cavalry and calvary, posses and possess will in most
languages other than English be very different words.

I think that gives this movement three interesting and relatively easy
routes to improve the quality of any machine translation of our projects.
The first would be to roll out the bot I've been using to find easily
confused words to other languages than English. The second would be to
collaborate with any of the existing machine translation providers and ask
them to give us lists of phrases in our work that they find hard to
translate. My hunch would be that this would be a good source of anomalies
that need fixing. The third would be word disambiguation, for languages
like English and Portuguese there is significant demand amongst our readers
for a choice as to which language variant they are reading. We could quite
easily offer a US/UK choice for registered editors, and default IP readers
in some parts of the world to the most likely variant. But to do this to a
reasonable standard of quality we would need to disambiguate words such as
bonnet, tap, flat and bear.

Of course there is a huge amount of work in doing the above, even if we had
some user friendly apps that showed people a paragraph of text and asked
them to pick the correct meaning of an ambiguous word. But these are
precisely the sort of easy entry level tasks that we need to give a gentle
learning path for new editors. Our edit filters and vandalfighting bots
have taken away vandalism as a way to recruit new editors, and AWB and
other quality improvement programs mean that we aren't recruiting as many
new editors from readers who just spot a typo and fix it. Word
disambiguation, possibly incorrect words and ambiguous sentences would be a
great way to recruit our readers to become editors. We just need a little
advert "can you spare a few minutes to improve Wikipedia" and we could
crowd source this to our readers.

Jonathan

Message: 7

> Date: Tue, 30 Apr 2013 16:26:44 -0400
> From: David Cuenca <dacuetu at gmail.com>
> To: Wikimedia Mailing List <wikimedia-l at lists.wikimedia.org>
> Subject: [Wikimedia-l] Human-assisted machine translation (it was:
>         "The case for supporting open source machine translation")
> Message-ID:
>         <
> CAJBSGSoRh_TPr++NOHokW7BCLDY7r1p1JhZW42LkcRLuYfQg6A at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> I have been giving some thought to Erik's proposal and while already
> fascinating, I would like to put it in different terms.
> Instead of asking "Could open source MT be such a strategic investment?", I
> would ask "is there a way to have Wikimedia's technology and people
> involved collaborate with MT systems?" The first can be seen as entering
> areas quite out of reach, the second would be more about paving the way for
> other actors that are already in the field. Our strength has always been
> based around human collaboration empowered by technology, and if MT is
> wished, then we should consider approaching it from our areas of expertise.
>
> One of the biggest problems in MT is word disambiguation. Wikidata's item
> properties could be a way of setting the general context for article
> translation, and if that results not to be reliable enough, users should
> have the opportunity to specify on the source text the intended meaning of
> a certain word. While that could be less than ideal for literary works,
> where double meanings and other subtleties must be taken into account, it
> might be quite useful for Wikipedia, providing MT software a fertile soil
> where to grow. The standards for specifying word meanings for MT software
> are unknown to me, but it might be worth exploring.
>
> Another interesting hurdle for MT is dictionary building. OmegaWiki seems
> like a system that could be used for bridging the gap between pairs of
> languages, in such a way that if we know the exact use of the word in the
> source language, a user could seamlessly fill in the missing word and
> definition in the target language. That could be a unique way of
> collaboration between source-language speakers providing precision about
> the meaning being used, and target-language speakers filling the gaps.
>
> Dictionaries alone are not enough. Grammar rules would need to be wikified.
> All in all, OmegaWiki/Wiktionary could become the front-end and repository
> for external MT systems, either to be used in Wikipedia or with other
> pages.
>
> It wouldn't be needed to create a new MT system, because the rule-based MT
> programs that could make use of such infraestructure already exist. Some of
> them are open-source too. If you are interested, I could ask for opinions
> about the feasability in the Apertium lists. In my opinion, they also fit
> into the "smartest, well-intentioned group of people" category that Erik
> was asking about.
>
> Cheers,
> David --Micru
>
>
> On Tue, Apr 30, 2013 at 4:15 AM, Chris Tophe <kipmaster at gmail.com> wrote:
>
> > 2013/4/29 Mathieu Stumpf <psychoslave at culture-libre.org>
> >
> > > Le 2013-04-26 20:27, Milos Rancic a écrit :
> > >
> > > OmegaWiki is a masterpiece from the perspective of one [computational]
> > >> linguist. Erik made the structure so well, that it's the best starting
> > >> point to create a contemporary multilingual dictionary. I didn't see
> > >> anything better in concept. (And, yes, when I was thinking about
> > >> creating such software by my own, I was always at the dead end of
> > >> "but, OmegaWiki is already that".)
> > >>
> > >
> > > Where can I find documentation about this structure, please ?
> >
> >
> >
> > Here (planned structure):
> > http://meta.wikimedia.org/wiki/OmegaWiki_data_design
> >
> > and also there (current structure):
> > http://www.omegawiki.org/Help:OmegaWiki_database_layout
> >
> > And a gentle reminder that comments are requested ;-)
> > http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
> > _______________________________________________
> > Wikimedia-l mailing list
> > Wikimedia-l at lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> >
>
>
>
> --
> Etiamsi omnes, ego non
>
>
> ------------------------------
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
>
> End of Wikimedia-l Digest, Vol 109, Issue 114
> *********************************************
>