I have made another machine translation run and removed particle insertion, the erroneous swahili lexicons identified by Martin Benjamin, and recompiled the swahili thesaurus based solely upon the Kamusi swahili lexicons, which Martin states are only partially completed and possibly has some ambiguities. Future runs of this project will be posted and announced after application of the grammar rules and full conjugation and sentence decomposition and reconstruction rule sets based upon Dr. Benjamin's parsing rules, which may be a month or two from now after more work is done on the grammar parser for this language. One other challenge is language drift into Arabic, which was explained to me that Swahili and many other African Languages have drifted to incorporate arabic language derivatives which may require overlapping rule sets to machine translate properly.
I have activiated the english link grammar parser for this second run and have begun using word paring against the Kamusi lexicons, which are not yet setup to fully handle these cases yet (but well on their way to this goal). The Cherokee language (and most native languages) produce words which are complete self contained morphemes and word meanings are typically not split accross word pairs as appears to be the case in Swahili, and the Cherokee parsers and lexicons are a lot further along, having been in development by our linguists for several years for this precise application (In Cherokee, each complex verb is in fact an entire self contained sentence of sorts - and some nouns as well). As Martin points out, this language has a lot more work to go to get to the same point the machine translator for Native American Languages has already reached with comprehesive lexicons and grammar rule sets for machine translation. Nonetheless, the tremendous potential Wikipedia machine translation holds for African Languages is compelling enough for the Wolf Mountain Group to approve funding for this effort to move it forward along with any other interested African Languages in support of the Wikimedia Foundations Projects and Goals for African Communities.
I still anticipate we can get to 90% by the end of Autumn. This project will be under development and regular updates which will be posted to the machine translations page setup by Sabine on Meta for African Languages. These first runs were examples to illustrate the power of Wikitrans to rapidly apply and create the whole of Wikipedia almost overnight in another language (provided the lexicons and rule sets are complete and accurate for the translator to rely upon). The African languages project is very useful to allow further abstractions to be instrumented in WikiTrans to deal with a multitude of languages for all of Wikimedia's projects, which is the ultimate goal.
The real value here are the grammar and parsing rule sets and word paring logic for each language and dialect. Over time, Wikitrans will develop a large body of these rule sets and lexicons for all interested languages we target. Rule sets may or may not be published, depending on the project and the interests of the contributors. French, Spanish, German, Dine, Italian, and other popular and pervasive language rule sets will certainly be published sometime this fall so folks interested in porting a language to WikiTrans can do so by writing rule sets and lexicons and submitting them to the project for test runs.
Latest run for swahili is at:
Latest lexicons, thesaurus, and xml dumps are at:
ftp://ftp.wikigaudgi.org/africa
Jeff
Jeffrey V. Merkey wrote:
Latest run for swahili is at:
Latest lexicons, thesaurus, and xml dumps are at:
ftp://ftp.wikigadugi.org/africa
Corrected.
Jeff
foundation-l mailing list foundation-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org