[Foundation-l] WikiTrans Support for All Languages
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Fri Aug 18 03:57:43 UTC 2006
Sabine Cretella wrote:
>Yes, this is a wikimedia project, but has less than 10,000 articles.
>Therefore your entry does not go into the first part of the list but
>under "other languages". And that is where I now moved your entry.
>
>I hope you understand that we understand well how relevant Cherokee for
>the regional languages is (please don't understand the term regional).
>Nonetheless all of us have to follow these rules and many languages are
>still below the 10,000 hurdle. I hope to see the number 1000 soon - let
>me know when you reach it ... it is hard work for a small community.
>Once you have 1000 the step to reach 5000 is much faster and from there
>to 10,000 is faster again.
>
>I know you are doing loads of work with machine translation on the
>non-wikimedia wiki - unfortunately, up to now I did not manage to
>install a proper font to be able to read that wiki - I tried more than
>once, but for some reason my computer refuses to install the font :-(
>
>Best wishes from Italy,
>
>Sabine
>Chiacchiera con i tuoi amici in tempo reale!
> http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
>
>
Sabine,
I think it's time to approach this subject based on your explanation of
language evolution. I make you an offer
to speed italian translation (and any other language) for your Wiki.
At present, I have wikitrans adapted to read XML dumps from the English
Wikipedia, dissect english using the CMU
lnk grammar parser, performing a lexicon word by word translation, then
running a conjugator and verb constructor
over the translated sentences and reordering noun-verb pairs and
morpheme and outputing into a target langauge with proper,
person, tense, and plurality. WikiTrans is setup to use hierarchical
lexicons and hierarchical thesarauses for English.
I can convert over 98% of English at present. The lexicon, thesaurus,
conjugators, and inference engine are language
neutral by design -- I can use it on any language.
I also have an AI inference engine which can tune the output to specific
dialects. It works as follows:
After I output into the target language in order to "teach" the
inference engine a set of rules, I take a dozen or so articles
and go through them by hand and correct them then run the inference
engine comparing and recording tensing and minor
corrections -- in other words it "learns" how to properly construct for
a unique dialect. I currently can tune it to output
into Giduwa or Otali in Cherokee by altering the lexicon hierarchy for a
target dialect.
I can parse and translate the entire Wikipedia 7GB XML dump in about 15
minutes. What can be done here is for me to
output into say -- italian and speed up the process of editing for a
target language and inputing articles by a proofreader.
In the current version, every 5th article or so in Cherokee requires me
to proofread them and correct very subtle errors in
tense or noun disambiguation, so its not perfect, but its better then
writing them by hand, and most of it is over 95% correct.
If you are interested, download the lexicons and roget thesauraus from
the FTP server at ftp.wikigadugi.org/wiki and replace
the cherokee words with italian or whatever and let me know where to get
them. will then post runs for any target language
and you can correct a dozen or so long articles and send me the files
back and I will run the inference engine and create table
rules that would give you close to 98% accuracy when compared against
the English Wikipedia. It will rapidly speed translations
to other languages and allow the non-english wiki's a better chance of
catching up. After we build the syntax rule databases
for each language, I'll give them back to you with an additional
extension that will allow you not only to translate wikipedia, but
also put a front end on a proxy server and allow web browsers to access
websites and translate pages and web content real time
similair to what google is doing now.
If you and the other non-english editors are interested, download and
populate the lexicons (I am at 230,000 words and phrases at present)
and everytime Wikipedia posts an XML dump, I'll machine translate them
and you can download them and use them to populate the
other Wiki's. We will need a system to exclude articles already reviewed
and translated.
My wife and son in law are going to help me get the German lexicons and
rule sets created for Deutch, the other languages are wide open.
Jeff
More information about the wikimedia-l
mailing list