[Foundation-l] WkiTrans Update/Swahili Machine Translation 20060817 (Corrections)
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Wed Aug 30 06:43:14 UTC 2006
I have made another machine translation run and removed particle
insertion, the erroneous swahili lexicons identified by Martin Benjamin,
and recompiled the swahili thesaurus based solely upon the Kamusi
swahili lexicons, which Martin states are only partially completed and
possibly has some ambiguities. Future runs of this project will be
posted and announced after application of the grammar rules and full
conjugation and sentence decomposition and reconstruction rule sets
based upon Dr. Benjamin's parsing rules, which may be a month or two
from now after
more work is done on the grammar parser for this language. One other
challenge is language drift into Arabic, which was explained to me that
Swahili and many other African Languages have drifted to incorporate
arabic language derivatives which may require overlapping rule sets to
machine translate properly.
I have activiated the english link grammar parser for this second run
and have begun using word paring against the Kamusi lexicons, which are not
yet setup to fully handle these cases yet (but well on their way to this
goal). The Cherokee language (and most native languages) produce
words which are complete self contained morphemes and word meanings are
typically not split accross word pairs as appears to be the case in
Swahili, and the Cherokee parsers and lexicons are a lot further along,
having been in development by our linguists for several years for this
precise application (In Cherokee, each complex verb is in fact an entire
self contained sentence of sorts - and some nouns as well). As Martin
points out, this language has a lot more work to go to get to the same
point the machine translator for Native American Languages has already
reached with comprehesive lexicons and grammar rule sets for machine
translation. Nonetheless, the tremendous potential Wikipedia machine
translation holds for African Languages is compelling enough for the
Wolf Mountain Group to approve funding for this effort to move it
forward along with any other interested African Languages in support of
the Wikimedia Foundations Projects and Goals for African Communities.
I still anticipate we can get to 90% by the end of Autumn. This project
will be under development and regular updates which will be posted to the
machine translations page setup by Sabine on Meta for African
Languages. These first runs were examples to illustrate the power of
Wikitrans
to rapidly apply and create the whole of Wikipedia almost overnight in
another language (provided the lexicons and rule sets are complete and
accurate
for the translator to rely upon). The African languages project is
very useful to allow further abstractions to be instrumented in
WikiTrans to deal with a multitude of languages for all of Wikimedia's
projects, which is the ultimate goal.
The real value here are the grammar and parsing rule sets and word
paring logic for each language and dialect. Over time, Wikitrans will
develop a
large body of these rule sets and lexicons for all interested languages
we target. Rule sets may or may not be published, depending on the project
and the interests of the contributors. French, Spanish, German, Dine,
Italian, and other popular and pervasive language rule sets will certainly
be published sometime this fall so folks interested in porting a
language to WikiTrans can do so by writing rule sets and lexicons and
submitting
them to the project for test runs.
Latest run for swahili is at:
http://sw.wikigadugi.org
Latest lexicons, thesaurus, and xml dumps are at:
ftp://ftp.wikigaudgi.org/africa
Jeff
More information about the foundation-l
mailing list