NOTE: I am replying to an older article because it was the most recent
thread I could find in my archives on the topic. I think Jimmy's comments
accurately reflect those of most people's (justified) low opinion of raw
(unaided) machine translation output.
On 8/9/04, Jimmy (Jimbo) Wales <jwales(a)wikia.com> wrote:
First, it is important to understand that for the most part, the
wikipedia languages are not mere translations.
Perhaps they should be, or more precisely, perhaps there should be a way to
get the English article translated into Urdu, as well as an Urdu version of
the article (with different, Urdu-centric content, as we have now). I'd be
interested in knowing how the French article on Sartre differed from the
English one (for example) but I don't read French.
Second, machine language translation is typically
There are ways to get much, much better machine translation with a little
extra effort from native speakers of the source language. If the words in an
article are part-of-speech (POS) tagged (noun, verb, adjective, preposition,
etc.) then the quality of machine translation of that text improves
I work for the Linguistic Data Consortium at the University of Pennsylvania,
where I provide IT support to a group of linguists who create and distribute
the corpora (datasets) used by the researchers (both public and private) who
develop machine translation systems, automatic content-extraction systems,
and a variety of other computational linguistic systems.
If people are interested, I'll look into getting a few articles POS-tagged
(bribe a linguistics grad student with free lunch or something) and run them
through some public (grant-funded, opensource) MT systems to demo the
output. If the output is reasonable enough to offer up on the site as-is, or
with minimal corrections (maybe a few sentences) then I'd think it might be
As a huge (and rapidly growing) collection of GFDL-ed text, the Wikipedia is
a valuable public linguistic resource. If it could also provide a set of
parallel text in several different languages (the human-corrected versions
of machine translated articles) then it would become even more valuable, a
virtual Rosetta Stone for the modern age.