machine translaton of the articles...

List overview All Threads
Download

newer

older

DokuWiki -> MediaWiki Konverter

Extending Wikipedia with...

prasad gadgil

9 Aug 2004 9 Aug '04

10:05 a.m.

Hi,

I have just joined, I am from mumbai, india. I would like to get the articles translated in marathi, my mother tongue. Looking at the effort and no of volunteers, this will not be usable in any reasonable amount of time.

That has made me think of alternatives - machine translation. A state funded institute has a software available but I don't have access to it yet.

Pl. comment about this approach. Has this been tried for any other language earlier.

Thanks & regards, Prasad Gadgil

________________________________________________________________________ Yahoo! India Matrimony: Find your life partner online Go to: http://yahoo.shaadi.com/india-matrimony

Show replies by date

Ivan Krstic

9 Aug 9 Aug

10:42 a.m.

prasad gadgil wrote:

...

That has made me think of alternatives - machine translation. [..] Pl. comment about this approach. Has this been tried for any other language earlier.

I'm not clear on what you're asking about. Machine translation has certainly been available for a variety of languages for quite some time, e.g. http://www.google.com/language_tools or AltaVista Babelfish (I believe they use the same SysTran backend). The technology keeps improving, but even after working with state of the art, backpropagating neural network-based translators, I remain unimpressed - don't have your hopes up too high.

If you're proposing adding a new Wikipedia language purely on the basis of translated articles, I do not believe that's been done, and it probably wouldn't be a good idea. Getting people that speak your language involved, and then having them individually use machine translation on articles they want to work on, fix those up, and put them into Wikipedia seems like a much better option.

Cheers, Ivan.

Jimmy (Jimbo) Wales

11:11 a.m.

prasad gadgil wrote:

...

I have just joined, I am from mumbai, india. I would like to get the articles translated in marathi, my mother tongue. Looking at the effort and no of volunteers, this will not be usable in any reasonable amount of time.

I understand. According to Wikipedia, there are 68 million native speakers of Marathi, and 3 million second language speakers. This is comparable, for example, to Italian, with 70 million total speakers.

Getting Wikipedia content in languages such as Marathi is absolutely central to our mission, and so I want you to know that I support you fully in your desire to find ways to make this happen in a timely and effective manner.

...

That has made me think of alternatives - machine translation. A state funded institute has a software available but I don't have access to it yet.

Pl. comment about this approach. Has this been tried for any other language earlier.

First, it is important to understand that for the most part, the individual wikipedia languages are not mere translations. In some cases, of course, contributors find it easier to just translate an article -- after all, for many many topics, en.wikipedia.org is the most convenient and best reference, and the content is available for free. But it is more generally the case in fr, de, ja, it, etc. that the articles are independent constructions from scratch.

Second, machine language translation is typically quite poor. It is not likely that the tool you are talking about can do a good enough job.

At the same time, I absolutely think that if you can find a way to get access (is there some way that I can help, for example by writing a letter to some authorities, or giving you a signature on an application, or even paying money, if it is not too much?) for wikipedia contributors to this state-funded agency's tool, then if it helps a small number of contributors to work faster, then I am all for it.

What I would envision is that you would have to do a machine translation and then immediately edit it to correct grammar and meaning mistakes. Whether that would be faster or slower than either writing the article from scratch or doing a human-translation yourself, I do not know.

--Jimbo

Bill Clark

20 Nov 20 Nov

9:46 a.m.

NOTE: I am replying to an older article because it was the most recent thread I could find in my archives on the topic. I think Jimmy's comments accurately reflect those of most people's (justified) low opinion of raw (unaided) machine translation output.

On 8/9/04, Jimmy (Jimbo) Wales jwales@wikia.com wrote:

...

First, it is important to understand that for the most part, the individual wikipedia languages are not mere translations.

Perhaps they should be, or more precisely, perhaps there should be a way to get the English article translated into Urdu, as well as an Urdu version of the article (with different, Urdu-centric content, as we have now). I'd be interested in knowing how the French article on Sartre differed from the English one (for example) but I don't read French.

...

Second, machine language translation is typically quite poor.

There are ways to get much, much better machine translation with a little extra effort from native speakers of the source language. If the words in an article are part-of-speech (POS) tagged (noun, verb, adjective, preposition, etc.) then the quality of machine translation of that text improves dramatically.

I work for the Linguistic Data Consortium at the University of Pennsylvania, where I provide IT support to a group of linguists who create and distribute the corpora (datasets) used by the researchers (both public and private) who develop machine translation systems, automatic content-extraction systems, and a variety of other computational linguistic systems.

If people are interested, I'll look into getting a few articles POS-tagged (bribe a linguistics grad student with free lunch or something) and run them through some public (grant-funded, opensource) MT systems to demo the output. If the output is reasonable enough to offer up on the site as-is, or with minimal corrections (maybe a few sentences) then I'd think it might be worth considering.

As a huge (and rapidly growing) collection of GFDL-ed text, the Wikipedia is a valuable public linguistic resource. If it could also provide a set of parallel text in several different languages (the human-corrected versions of machine translated articles) then it would become even more valuable, a virtual Rosetta Stone for the modern age.

-Bill Clark

Ray Saintonge

3:32 p.m.

Bill Clark wrote:

...

NOTE: I am replying to an older article because it was the most recent thread I could find in my archives on the topic. I think Jimmy's comments accurately reflect those of most people's (justified) low opinion of raw (unaided) machine translation output.

On 8/9/04, Jimmy (Jimbo) Wales jwales@wikia.com wrote:

...
First, it is important to understand that for the most part, the individual wikipedia languages are not mere translations.

Perhaps they should be, or more precisely, perhaps there should be a way to get the English article translated into Urdu, as well as an Urdu version of the article (with different, Urdu-centric content, as we have now). I'd be interested in knowing how the French article on Sartre differed from the English one (for example) but I don't read French.

What is notable is the the Sartre and other articles in French and English are independently written rather than one being the translation of the other. At first glance the Spanish version appears to be a translation from English, and It is conceivable that one or more of the 37 other current versions of the Sartre article are translations, but I'm not in a position to verify that because of my limited knowledge of these languages. The Urdu version has not yet been written.

Looking at the brief introductory paragraph, and the first biography paragraph we have in English

...

Jean-Paul Charles Aymard Sartre (1905 http://en.wikipedia.org/wiki/1905-06-21 http://en.wikipedia.org/wiki/June_21 – 1980 http://en.wikipedia.org/wiki/1980-04-15 http://en.wikipedia.org/wiki/April_15) was a French http://en.wikipedia.org/wiki/France existentialist http://en.wikipedia.org/wiki/Existentialism philosopher http://en.wikipedia.org/wiki/Philosopher, dramatist http://en.wikipedia.org/wiki/Playwright, novelist http://en.wikipedia.org/wiki/Novelist and critic http://en.wikipedia.org/wiki/Literary_criticism.

...

Early life and thought
Sartre was born in Paris http://en.wikipedia.org/wiki/Paris to parents Jean-Baptiste Sartre, an officer http://en.wikipedia.org/wiki/Naval_officer of the French Navy http://en.wikipedia.org/wiki/French_Navy, and Anne-Marie Schweitzer, cousin of Albert Schweitzer http://en.wikipedia.org/wiki/Albert_Schweitzer. When he was 15 months old, his father died of a fever http://en.wikipedia.org/wiki/Fever and Anne-Marie raised him with help from her father, Charles Schweitzer, who taught Sartre mathematics http://en.wikipedia.org/wiki/Mathematics and introduced him to classical literature http://en.wikipedia.org/wiki/Classics at an early age.

In French we have

...

Jean-Paul Sartre (Paris http://fr.wikipedia.org/wiki/Paris 21 juin http://fr.wikipedia.org/wiki/21_juin 1905 http://fr.wikipedia.org/wiki/1905 - Paris 15 avril http://fr.wikipedia.org/wiki/15_avril 1980 http://fr.wikipedia.org/wiki/1980) est un philosophe http://fr.wikipedia.org/wiki/Philosophe et écrivain français http://fr.wikipedia.org/wiki/%C3%89crivains_Fran%C3%A7ais_Par_Ordre_Alphab%C3%A9tique.

[ Biographie

Né à Paris http://fr.wikipedia.org/wiki/Paris le 21 juin http://fr.wikipedia.org/wiki/21_juin 1905 http://fr.wikipedia.org/wiki/1905, Sartre est orphelin de père à deux ans et grandit à Paris http://fr.wikipedia.org/wiki/Paris, dans un milieu bourgeois et intellectuel. Il fait ses études secondaires au lycée Henri IV http://fr.wikipedia.org/wiki/Lyc%C3%A9e_Henri_IV, où il fait la connaissance de Paul Nizan http://fr.wikipedia.org/wiki/Paul_Nizan.

which translates as

Jean-Paul Sartre (Paris, June 21, 1905 - Paris, April 15, 1980) is a French philosopher and writer

Biography

Born in Paris on June 21, 1905, Sartre was paternally orphaned at two years old and grew up in Paris, in a bourgeois and intellectual environment. His secondary studies were done at the Lycée Henri IV, where he became acquainted with Paul Nizan

It is interesting to note that reference to the Schweitzer family appears nowhere in the French article, and Paul Nizan appears nowhere in the English article!

...

...
Second, machine language translation is typically quite poor.

There are ways to get much, much better machine translation with a little extra effort from native speakers of the source language. If the words in an article are part-of-speech (POS) tagged (noun, verb, adjective, preposition, etc.) then the quality of machine translation of that text improves dramatically.

I agree that there are ways to improve machine translations, but it strikes me as impossible for machines to reconcile the cultural gaps which may exist between language versions. That requires the intervention of thinking humans.

Gregory Maxwell

5 p.m.

On 11/20/05, Bill Clark wclarkxoom@gmail.com wrote: [snip]

...

There are ways to get much, much better machine translation with a little extra effort from native speakers of the source language. If the words in an article are part-of-speech (POS) tagged (noun, verb, adjective, preposition, etc.) then the quality of machine translation of that text improves dramatically.

[snip]

I've been making a little effort on and off again to improve the parsability of Wikipedia articles by Link Grammar (http://bobo.link.cs.cmu.edu/link/). Generally the formal style used on most articles provides easy material for link grammar to correctly parse and most of the statements that unparsable are clear grammatical or spelling mistakes.

Generally the two biggest sources of parse errors which can not be attributed to an obvious mistake that I've run into using link grammar is the omission of the serial comma, and subject area verbs which are not in my dictionary. I'm not sure why the serial comma isn't required in the manual of style as it's omission sometimes causes human readers to incorrectly group objects.

I think that machine readability for Wikipedia should be a long term goal, even if we do not intend to use it to facilitate translation. Generally text which is machine parsable without markup also tends to be more easily readable by human readers who have widely varying levels of skill. Once we factor in the improvements in searching, translation, and machine intelligence, the desirability of machine parsibility becomes more clear.

For example, I've toyed with making my content filtering bot (output available on freenode irc in #wikipedia-suspectedits) use link grammar to parse sentences and detect when someone has negated/inverted the meaning of a sentence. Unfortunately I can't put this into production on my bot because the machine parsability of Wikipedia is currently too low, and link-grammar's performance on difficult to parse text is currently too low.

Andre Engels

10 Aug 10 Aug

12:07 p.m.

On Mon, 9 Aug 2004 15:05:27 +0100 (BST) prasad gadgil gadgil_p@yahoo.co.in wrote:

...

Hi,

I have just joined, I am from mumbai, india. I would like to get the articles translated in marathi, my mother tongue. Looking at the effort and no of volunteers, this will not be usable in any reasonable amount of time.

That has made me think of alternatives - machine translation. A state funded institute has a software available but I don't have access to it yet.

Pl. comment about this approach. Has this been tried for any other language earlier.

Machine translations are in general of poor quality. They vary from 'containing a few strange errors' to 'not understandable at all'. If you intend to use machine translations, I think it would be better to take the translation as the basis for your own text rather than using it directly, or even after removing grammar errors.

Andre Engels

Nikola Smolenski

10:16 p.m.

On Monday 09 August 2004 16:05, prasad gadgil wrote:

...

Pl. comment about this approach. Has this been tried for any other language earlier.

There is entire English Wikipedia translated to Czech at http://wikipedia.infostar.cz/ . But the quality of the translation is not too good, I am told.

Michal Jurosz

11 Aug 11 Aug

4:54 a.m.

Nikola Smolenski wrote:

...

There is entire English Wikipedia translated to Czech at http://wikipedia.infostar.cz/ . But the quality of the translation is not too good, I am told.

Hi,

I and some linguisticians think, that quality of infostar.cz translator is realy poor. But Google index this translated pages, so Czech users can find original (English version of page) with keywords in Czech.

I start working on Perl Wikipedia ToolKit ( http://wiki.kn.vutbr.cz/mj/index.cgi?Perl%20Wikipedia%20ToolKit ). I would like to add modules for "Computer aided Wikipedia translation". I think that program can translate article titles, category names and some sentences (translation memory) without big mistakes at least. Than script can watch original of translated pages and say "translation update needed for these sentences: ....". For some pages from namespace Template or from Category:Lists ( e.g. Template:Wikipediatoc, List of dog breeds, ... ) machine translation can realy help.

-- ------------------------------ S pozdravem Michal Jurosz rootmj@seznam.cz ICQ#:93348414 http://mjhome.zde.cz ------------------------------

6987

Age (days ago)

7455

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

9 participants

tags (0)

participants (9)

Andre Engels
Bill Clark
Gregory Maxwell
Ivan Krstic
Jimmy (Jimbo) Wales
Michal Jurosz
Nikola Smolenski
prasad gadgil
Ray Saintonge