Re: [Wikitech-l] machine translaton of the articles...

20 Nov 2005

NOTE: I am replying to an older article because it was the most recent
thread I could find in my archives on the topic. I think Jimmy's comments
accurately reflect those of most people's (justified) low opinion of raw
(unaided) machine translation output.

On 8/9/04, Jimmy (Jimbo) Wales &lt;jwales(a)wikia.com&gt; wrote:
...

 First, it is important to understand that for the most part, the
 individual
 wikipedia languages are not mere translations. 

Perhaps they should be, or more precisely, perhaps there should be a way to
get the English article translated into Urdu, as well as an Urdu version of
the article (with different, Urdu-centric content, as we have now). I'd be
interested in knowing how the French article on Sartre differed from the
English one (for example) but I don't read French.

...
  Second, machine language translation is typically
quite poor. 

There are ways to get much, much better machine translation with a little
extra effort from native speakers of the source language. If the words in an
article are part-of-speech (POS) tagged (noun, verb, adjective, preposition,
etc.) then the quality of machine translation of that text improves
dramatically.

I work for the Linguistic Data Consortium at the University of Pennsylvania,
where I provide IT support to a group of linguists who create and distribute
the corpora (datasets) used by the researchers (both public and private) who
develop machine translation systems, automatic content-extraction systems,
and a variety of other computational linguistic systems.

If people are interested, I'll look into getting a few articles POS-tagged
(bribe a linguistics grad student with free lunch or something) and run them
through some public (grant-funded, opensource) MT systems to demo the
output. If the output is reasonable enough to offer up on the site as-is, or
with minimal corrections (maybe a few sentences) then I'd think it might be
worth considering.

As a huge (and rapidly growing) collection of GFDL-ed text, the Wikipedia is
a valuable public linguistic resource. If it could also provide a set of
parallel text in several different languages (the human-corrected versions
of machine translated articles) then it would become even more valuable, a
virtual Rosetta Stone for the modern age.

-Bill Clark

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] machine translaton of the articles...