NOTE: I am replying to an older article because it was the most recent thread I could find in my archives on the topic. I think Jimmy's comments accurately reflect those of most people's (justified) low opinion of raw (unaided) machine translation output.
On 8/9/04, Jimmy (Jimbo) Wales jwales@wikia.com wrote:
First, it is important to understand that for the most part, the individual wikipedia languages are not mere translations.
Perhaps they should be, or more precisely, perhaps there should be a way to get the English article translated into Urdu, as well as an Urdu version of the article (with different, Urdu-centric content, as we have now). I'd be interested in knowing how the French article on Sartre differed from the English one (for example) but I don't read French.
Second, machine language translation is typically quite poor.
There are ways to get much, much better machine translation with a little extra effort from native speakers of the source language. If the words in an article are part-of-speech (POS) tagged (noun, verb, adjective, preposition, etc.) then the quality of machine translation of that text improves dramatically.
I work for the Linguistic Data Consortium at the University of Pennsylvania, where I provide IT support to a group of linguists who create and distribute the corpora (datasets) used by the researchers (both public and private) who develop machine translation systems, automatic content-extraction systems, and a variety of other computational linguistic systems.
If people are interested, I'll look into getting a few articles POS-tagged (bribe a linguistics grad student with free lunch or something) and run them through some public (grant-funded, opensource) MT systems to demo the output. If the output is reasonable enough to offer up on the site as-is, or with minimal corrections (maybe a few sentences) then I'd think it might be worth considering.
As a huge (and rapidly growing) collection of GFDL-ed text, the Wikipedia is a valuable public linguistic resource. If it could also provide a set of parallel text in several different languages (the human-corrected versions of machine translated articles) then it would become even more valuable, a virtual Rosetta Stone for the modern age.
-Bill Clark