[Foundation-l] WikiTrans Support for All Languages
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Fri Aug 18 04:06:27 UTC 2006
Jeffrey V. Merkey wrote:
>Sabine Cretella wrote:
>
>
>
>
>
>>Yes, this is a wikimedia project, but has less than 10,000 articles.
>>Therefore your entry does not go into the first part of the list but
>>under "other languages". And that is where I now moved your entry.
>>
>>I hope you understand that we understand well how relevant Cherokee for
>>the regional languages is (please don't understand the term regional).
>>Nonetheless all of us have to follow these rules and many languages are
>>still below the 10,000 hurdle. I hope to see the number 1000 soon - let
>>me know when you reach it ... it is hard work for a small community.
>>Once you have 1000 the step to reach 5000 is much faster and from there
>>to 10,000 is faster again.
>>
>>I know you are doing loads of work with machine translation on the
>>non-wikimedia wiki - unfortunately, up to now I did not manage to
>>install a proper font to be able to read that wiki - I tried more than
>>once, but for some reason my computer refuses to install the font :-(
>>
>>Best wishes from Italy,
>>
>>Sabine
>>Chiacchiera con i tuoi amici in tempo reale!
>>http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
>>
>>
>>
>>
>
>Sabine,
>
>I think it's time to approach this subject based on your explanation of
>language evolution. I make you an offer
>to speed italian translation (and any other language) for your Wiki.
>
>At present, I have wikitrans adapted to read XML dumps from the English
>Wikipedia, dissect english using the CMU
>lnk grammar parser, performing a lexicon word by word translation, then
>running a conjugator and verb constructor
>over the translated sentences and reordering noun-verb pairs and
>morpheme and outputing into a target langauge with proper,
>person, tense, and plurality. WikiTrans is setup to use hierarchical
>lexicons and hierarchical thesarauses for English.
>I can convert over 98% of English at present. The lexicon, thesaurus,
>conjugators, and inference engine are language
>neutral by design -- I can use it on any language.
>
>I also have an AI inference engine which can tune the output to specific
>dialects. It works as follows:
>
>After I output into the target language in order to "teach" the
>inference engine a set of rules, I take a dozen or so articles
>and go through them by hand and correct them then run the inference
>engine comparing and recording tensing and minor
>corrections -- in other words it "learns" how to properly construct for
>a unique dialect. I currently can tune it to output
>into Giduwa or Otali in Cherokee by altering the lexicon hierarchy for a
>target dialect.
>
>I can parse and translate the entire Wikipedia 7GB XML dump in about 15
>minutes. What can be done here is for me to
>output into say -- italian and speed up the process of editing for a
>target language and inputing articles by a proofreader.
>
>In the current version, every 5th article or so in Cherokee requires me
>to proofread them and correct very subtle errors in
>tense or noun disambiguation, so its not perfect, but its better then
>writing them by hand, and most of it is over 95% correct.
>
>If you are interested, download the lexicons and roget thesauraus from
>the FTP server at ftp.wikigadugi.org/wiki and replace
>the cherokee words with italian or whatever and let me know where to get
>them. will then post runs for any target language
>and you can correct a dozen or so long articles and send me the files
>back and I will run the inference engine and create table
>rules that would give you close to 98% accuracy when compared against
>the English Wikipedia. It will rapidly speed translations
>to other languages and allow the non-english wiki's a better chance of
>catching up. After we build the syntax rule databases
>for each language, I'll give them back to you with an additional
>extension that will allow you not only to translate wikipedia, but
>also put a front end on a proxy server and allow web browsers to access
>websites and translate pages and web content real time
>similair to what google is doing now.
>
>
I want to use this with MediaWiki for external links from articles that
will allow real time translation
of linked content from other sites wikipedia refers to if they are
english. I have not addressed going from
non-english to other languages, but these extensions could be developed
and I am not oppposed to
opening up the translator. I want to get further along with language
neutral abstraction layers
in WikiTrans before we open it up.
Jeff
>If you and the other non-english editors are interested, download and
>populate the lexicons (I am at 230,000 words and phrases at present)
>and everytime Wikipedia posts an XML dump, I'll machine translate them
>and you can download them and use them to populate the
>other Wiki's. We will need a system to exclude articles already reviewed
>and translated.
>
>My wife and son in law are going to help me get the German lexicons and
>rule sets created for Deutch, the other languages are wide open.
>
>Jeff
>_______________________________________________
>foundation-l mailing list
>foundation-l at wikimedia.org
>http://mail.wikipedia.org/mailman/listinfo/foundation-l
>
>
>
More information about the wikimedia-l
mailing list