[Foundation-l] WikiTrans Support for All Languages

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Fri Aug 18 03:57:43 UTC 2006


Sabine Cretella wrote:

  

>Yes, this is a wikimedia project, but has less than 10,000 articles. 
>Therefore your entry does not go into the first part of the list but 
>under "other languages". And that is where I now moved your entry.
>
>I hope you understand that we understand well how relevant Cherokee for 
>the regional languages is (please don't understand the term regional). 
>Nonetheless all of us have to follow these rules and many languages are 
>still below the 10,000 hurdle. I hope to see the number 1000 soon - let 
>me know when you reach it ... it is hard work for a small community. 
>Once you have 1000 the step to reach 5000 is much faster and from there 
>to 10,000 is faster again.
>
>I know you are doing loads of work with machine translation on the 
>non-wikimedia wiki - unfortunately, up to now I did not manage to 
>install a proper font to be able to read that wiki - I tried more than 
>once, but for some reason my computer refuses to install the font :-(
>
>Best wishes from Italy,
>
>Sabine
>Chiacchiera con i tuoi amici in tempo reale! 
> http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 
>  
>

Sabine,

I think it's time to approach this subject based on your explanation of 
language evolution. I make you an offer
to speed italian translation (and any other language) for your Wiki.

At present, I have wikitrans adapted to read XML dumps from the English 
Wikipedia, dissect english using the CMU
lnk grammar parser, performing a lexicon word by word translation, then 
running a conjugator and verb constructor
over the translated sentences and reordering noun-verb pairs and 
morpheme and outputing into a target langauge with proper,
person, tense, and plurality. WikiTrans is setup to use hierarchical 
lexicons and hierarchical thesarauses for English.
I can convert over 98% of English at present. The lexicon, thesaurus, 
conjugators, and inference engine are language
neutral by design -- I can use it on any language.

I also have an AI inference engine which can tune the output to specific 
dialects. It works as follows:

After I output into the target language in order to "teach" the 
inference engine a set of rules, I take a dozen or so articles
and go through them by hand and correct them then run the inference 
engine comparing and recording tensing and minor
corrections -- in other words it "learns" how to properly construct for 
a unique dialect. I currently can tune it to output
into Giduwa or Otali in Cherokee by altering the lexicon hierarchy for a 
target dialect.

I can parse and translate the entire Wikipedia 7GB XML dump in about 15 
minutes. What can be done here is for me to
output into say -- italian and speed up the process of editing for a 
target language and inputing articles by a proofreader.

In the current version, every 5th article or so in Cherokee requires me 
to proofread them and correct very subtle errors in
tense or noun disambiguation, so its not perfect, but its better then 
writing them by hand, and most of it is over 95% correct.

If you are interested, download the lexicons and roget thesauraus from 
the FTP server at ftp.wikigadugi.org/wiki and replace
the cherokee words with italian or whatever and let me know where to get 
them. will then post runs for any target language
and you can correct a dozen or so long articles and send me the files 
back and I will run the inference engine and create table
rules that would give you close to 98% accuracy when compared against 
the English Wikipedia. It will rapidly speed translations
to other languages and allow the non-english wiki's a better chance of 
catching up. After we build the syntax rule databases
for each language, I'll give them back to you with an additional 
extension that will allow you not only to translate wikipedia, but
also put a front end on a proxy server and allow web browsers to access 
websites and translate pages and web content real time
similair to what google is doing now.

If you and the other non-english editors are interested, download and 
populate the lexicons (I am at 230,000 words and phrases at present)
and everytime Wikipedia posts an XML dump, I'll machine translate them 
and you can download them and use them to populate the
other Wiki's. We will need a system to exclude articles already reviewed 
and translated.

My wife and son in law are going to help me get the German lexicons and 
rule sets created for Deutch, the other languages are wide open.

Jeff



More information about the wikimedia-l mailing list