New subject: [Foundation-l] WkiTrans Update/Swahili Machine Translation 20060817 (Corrected FTP URL)

30 Aug 2006

I have made another machine translation run and removed particle 
insertion, the erroneous swahili lexicons identified by Martin Benjamin,
and recompiled the swahili thesaurus based solely upon the Kamusi 
swahili lexicons, which Martin states are only partially completed and
possibly has some ambiguities.   Future runs of this project will be 
posted and announced after application of the grammar rules and full 
conjugation and sentence decomposition and reconstruction rule sets 
based upon Dr. Benjamin's parsing rules, which may be a month or two 
from now after
more work is done on the grammar parser for this language.   One other 
challenge is language drift into Arabic, which was explained to me that
Swahili and many other African Languages have drifted to incorporate 
arabic language derivatives which may require overlapping rule sets to
machine translate properly.

I have activiated the english link grammar parser for this second run 
and have begun using word paring against the Kamusi lexicons, which are not
yet setup to fully handle these cases yet (but well on their way to this 
goal).   The Cherokee language (and most native languages)  produce 
words which are complete self contained morphemes and word meanings are 
typically not split accross word pairs as appears to be the case in 
Swahili, and the Cherokee parsers and lexicons are a lot further along, 
having been in development by our linguists for several years for this 
precise application (In Cherokee, each complex verb is in fact an entire 
self contained sentence of sorts - and some nouns as well).  As Martin 
points out, this language has a lot more work to go to get to the same 
point the machine translator for Native American Languages has already 
reached with comprehesive lexicons and grammar rule sets for machine 
translation.   Nonetheless, the tremendous potential Wikipedia machine 
translation holds for African Languages is compelling enough for the 
Wolf Mountain Group to approve funding for this effort to move it 
forward along with any other interested African Languages in support of 
the Wikimedia Foundations Projects and Goals for African Communities.

I still anticipate we can get to 90% by the end of Autumn.  This project 
will be under development and regular updates which will be posted to the
machine translations page setup by Sabine on Meta for African 
Languages.  These first runs were examples to illustrate the power of 
Wikitrans
to rapidly apply and create the whole of Wikipedia almost overnight in 
another language (provided the lexicons and rule sets are complete and 
accurate
for the translator to rely upon).   The African languages project is 
very useful to allow further abstractions to be instrumented in 
WikiTrans to deal with a multitude of languages for all of Wikimedia's 
projects, which is the ultimate goal. 

The real value here are the grammar and parsing rule sets and word 
paring logic for each language and dialect.  Over time, Wikitrans will 
develop a
large body of these rule sets and lexicons for all interested languages 
we target.  Rule sets may or may not be published, depending on the project
and the interests of the contributors.  French, Spanish, German, Dine, 
Italian, and other popular and pervasive language rule sets will certainly
be published sometime this fall so folks interested in porting a 
language to WikiTrans can do so by writing rule sets and lexicons and 
submitting
them to the project for test runs. 

Latest run for swahili is at:

http://sw.wikigadugi.org

Latest lexicons, thesaurus, and xml dumps are at:

ftp://ftp.wikigaudgi.org/africa

Jeff

[Foundation-l] WkiTrans Update/Swahili Machine Translation 20060817 (Corrections)