Jeff,
I applaud you for your initiative - your effort is impressive, albeit unreadable. I'll give my feedback in this post, and then suggest we take the discussion of the specifics of Swahili translation off-list (and welcome others who want to keep track of this thread to email us to stay in the cc loop). The last 2 or 3 paragraphs of this post do speak to the wider discussion list, so other readers might wish to SKIP TOWARD THE BOTTOM.
The first problem derives from your sources. The first source, "public swahili lexicon," is a useless set of about 1000 nouns, adjectives, and conjunctions, essentially a tourist vocabulary without any verbs. I would be surprised if that list gave any pairings that weren't also in the other lists. The third source, "rogets thesaurus in swahili," is one I would like to know more about, but is not useful for machine translation purposes in the configuration you've set up - for example, scroll down to line 51382, and look at the following 100-odd pairs for "idhini" in no particular order, with no way to distinguish among parts of speech, shades of meaning, relative frequency, etc. However, I was heartened to see line 45405 and following; I'm sure that if any wikipedia entries need to be translated that include "assify," "torpedinous," or "macht nichts," this thesaurus will prove quite handy. It looks like someone started with a smallish Swahili-English wordlist, plugged that into an English thesaurus, and extrapolated dozens of additional English equivalents per word, yielding an intriguing but lexicographically suspect set of equivalencies.
Which leads us to the Kamusi Project as a source. I will be the first to say that the Kamusi is a pretty good Swahili dictionary that will one day be a great Swahili dictionary, but at the moment contains significant weaknesses that prevent it from being a reliable source for machine translation. The first issue is the quality of the data. The initial data were manually input from an existing print dictionary to which we were granted copyright permission. Unfortunately, the students entering the data, before we programmed the Edit Engine, introduced a lot of errors. I am currently in the process of going through the database entry by entry, fixing those errors and adding in new heaps of data, including information for many data fields that we hadn't introduced during the initial data entry phase. This is an incredibly time consuming, research-intensive task, and I don't foresee having a Swahili->English dictionary that I am really happy with for another couple of years (at best - the thesaurus above, and my wife, would describe our current funding situation as "pauperized").
The Kamusi lexicon is much better as a Swa->Eng source than as an English->Swahili dictionary, because that is the direction in which we've input most of the initial data. The magic of databases makes it possible to have our data available bi-directionally, but the E->S version of the Kamusi needs its own careful review. That review can only come after the S->E data are thoroughly updated. Most especially, precious few E->S entries have been arranged with the Grouping Tool ( http://research.yale.edu/swahili/serve_pages/groupingtool_en.php ), so most entries appear in an arbitrary order that does not account for homographs, differing senses, frequency, etc. So, it would be premature to use the E->S Kamusi lexicon as a platform for machine translation, even though we do intend to get there.
When the data are ready for machine use, the program would also need to check the four "alternate spellings" fields, to pick up all the color v. colour issues that occur in both English and Swahili. Also, I would think that you would want to keep part of speech info associated with each line, which would make it much easier to employ grammar rules. A grammar hint: in Swahili, the adjective always comes *after* the noun that it modifies, except for the words "kila," "nusu," and "robo", and a few other cases, including the numbers preceding "elfu" for thousands between 11,000 and 99,000.
Another hint: Swahili does not use articles, so you need to get rid of most attempts at translations of a/ an/ the. When an article is absolutely necessary (which a computer would have a difficult time predicting), Swahili uses variations of "one" for a/ an, and "that" for the. Just getting rid of the articles in your articles would be a 100% improvement (bringing them up to 2% readable).
Ok, now assume we have good data, with a good way of predicting which words were appropriate in which circumstances (something that will eventually be aided by the work now being done toward building a central OmegaT database), and a good set of grammar rules. You would still need to deal with the agglutinative Swahili verb in all its glory. The Kamusi Project has a good parser embedded in our Swahili->English search, which disentangles the front end of any conjugated Swahili verb according to an analysis of every grammatical rule in the language. (We have a similar analysis completed and written in pseudo-code for the back end, the verbal extensions, but ran out of money and had to lay off our programmer before we could code it into the search engine.) Even taking advantage of our parser, your translating software would need to go the other way, building Swahili verbs from conjugated English verbs. You would need to account for the noun classes of each noun that is referred to in the verb (as many as three different nouns, each of which is either one of four different conversational participants or belongs to one of 16 different noun classes), which involves trivial calls to our database once you've identified the appropriate elements in the English sentence and chosen the relevant nouns - the "class" field is the key here. The real problem comes from conjugated English verbs. You need some way of knowing that "catches/ caught" relate to "catch," which would involve a database of English verbs and their irregular forms, and then you would need to map the various movable elements of the English sentence to the appropriate fixed points of the Swahili verb. Not an impossible task to achieve to 90% over time, but not nearly as straightforward as you are hoping.
Of course, this is all for Swahili, for which we have a pretty good initial lexicon en route to becoming excellent, a complete description of grammatical rules, and an accepted, unicoded orthography. Most other African languages, even those spoken by millions of people, are missing some or all of those elements in digital form. So, even if you could get pretty good machine translation of Wikipedia for Swahili, you would still be a long, long way away from rolling with other languages.
And we still haven't dealt with content. What's to say that content that is appropriate for the English Wikipedia is appropriate for the Swahili Wikipedia? For example, the entry for Agriculture. It begins by discussing the derivation of the word "agriculture," which is of course irrelevant for Swahili. Then it carries an unacknowledged POV about modern agriculture (as though the vast numbers of Africans who earn their livings with hand hoes are pre-modern museum relics, and let's not even click on the link to "subsistence farming" that talks about "life outside of modern society"), and essentially ignores all of the issues of raising crops on small farms that would be of immediate interest to an African farmer logging in from an internet kiosk. (Comment to those who fear paternalism in this endeavor: the people I live and work among in Tanzania express a huge interest in having access to this sort of information, although they are not in a position to contribute to the development of the resource.) So, an African farmer trying to combat an insect infestation on her farm would find a translation of an English "agriculture" article that focuses on technology-intensive farming to be much less useful than an article started almost from scratch that addressed farming in the context of speakers of that language. It just happens that "agriculture" was the second article I clicked to by following links from the initial article on the pseudo-Swahili test site - what similar issues would arise on the fourth article, or the tenth, or the 997,032nd?
There's also the issue that a great many of the current English Wikipedia articles are works in progress, of varying quality. Would you do a one time machine translation of the current Wikipedia, and ignore all future edits? Translate only "stable versions? Re-translate articles every time there is a change? Re-translate every time the Kamusi Project data is updated (hundreds of times a week)? Have the machine overwrite manual edits that someone did to machine translations, when the English version changes? Do this for dozens of African languages, and hundreds of languages around the world?
I don't want to dismiss the entire endeavor, although I've been working on these issues for long enough to be sure that the undertaking is much more complicated than you're estimating. Here's where I think your translation project might prove useful: if a speaker of, for example, Swahili went searching for an entry that didn't already exist in the Swahili Wikipedia, an application could build a version of that page on-the-fly from the English version that is current at that moment. The Wikipedia user could then either (a) glean whatever information she could from the article and move on, (b) laugh uproariously, or (c) go into edit mode, work to turn the machine translation into something readable in Swahili, and save that version - which would then become the baseline page for that entry in that language, from which future edits could take off. In this way, you would get the best of both worlds - good articles written in the actual language whenever possible, and fingertip access to rough machine translations from English when articles are not initially available in the target language.
Martin,
Thanks for the phone call today and let's get going on taking this off list and getting the parser and grammar rules put together for the translator. I think we could get this machine assisted translation of over 90% after application of a set of grammar rules. I wanted to demonstrate how rapidly this could be done and embrace the full wikipedia. As I said in the earlier post, I have not applied any grammar, tensing, word pairing, or other rules, just a simple word by word and phrase translation. I am sending the materials we discussed and I anticiapte we can get to 95% accuracy by the end of autum. This would provide an excellent basis for your African Languages Programs.
P.S. I agree that the public swahili dictionary is junk, BTW.
Jeff
Martin Benjamin wrote:
Jeff,
I applaud you for your initiative - your effort is impressive, albeit unreadable. I'll give my feedback in this post, and then suggest we take the discussion of the specifics of Swahili translation off-list (and welcome others who want to keep track of this thread to email us to stay in the cc loop). The last 2 or 3 paragraphs of this post do speak to the wider discussion list, so other readers might wish to SKIP TOWARD THE BOTTOM.
Martin Benjamin wrote:
It looks like someone started with a smallish Swahili-English wordlist, plugged that into an English thesaurus, and extrapolated dozens of additional English equivalents per word, yielding an intriguing but lexicographically suspect set of equivalencies.
Actually, I used yours from the Kamusi project. All of this needs review and correction. I went through the Cherokee versions and corrected them by hand. Took weeks of tedious work, and we are still working on more.
But for four hours of work, pretty good progress .... :-)
:-)
Jeff
wikimedia-l@lists.wikimedia.org