[Foundation-l] Swahili Machine Translation First Run Completed for enwiki-20060817

Tue Aug 29 17:53:10 UTC 2006

Jeff,

I applaud you for your initiative - your effort is impressive, albeit 
unreadable.  I'll give my feedback in this post, and then suggest we 
take the discussion of the specifics of Swahili translation off-list 
(and welcome others who want to keep track of this thread to email us to 
stay in the cc loop).  The last 2 or 3 paragraphs of this post do speak 
to the wider discussion list, so other readers might wish to SKIP TOWARD 
THE BOTTOM.

The first problem derives from your sources.  The first source, "public 
swahili lexicon," is a useless set of about 1000 nouns, adjectives, and 
conjunctions, essentially a tourist vocabulary without any verbs.  I 
would be surprised if that list gave any pairings that weren't also in 
the other lists.  The third source, "rogets thesaurus in swahili," is 
one I would like to know more about, but is not useful for machine 
translation purposes in the configuration you've set up - for example, 
scroll down to line 51382, and look at the following 100-odd pairs for 
"idhini" in no particular order, with no way to distinguish among parts 
of speech, shades of meaning, relative frequency, etc.  However, I was 
heartened to see line 45405 and following; I'm sure that if any 
wikipedia entries need to be translated that include "assify," 
"torpedinous," or "macht nichts," this thesaurus will prove quite handy. 
  It looks like someone started with a smallish Swahili-English 
wordlist, plugged that into an English thesaurus, and extrapolated 
dozens of additional English equivalents per word, yielding an 
intriguing but lexicographically suspect set of equivalencies.

Which leads us to the Kamusi Project as a source.  I will be the first 
to say that the Kamusi is a pretty good Swahili dictionary that will one 
day be a great Swahili dictionary, but at the moment contains 
significant weaknesses that prevent it from being a reliable source for 
machine translation.  The first issue is the quality of the data.  The 
initial data were manually input from an existing print dictionary to 
which we were granted copyright permission.  Unfortunately, the students 
entering the data, before we programmed the Edit Engine, introduced a 
lot of errors.  I am currently in the process of going through the 
database entry by entry, fixing those errors and adding in new heaps of 
data, including information for many data fields that we hadn't 
introduced during the initial data entry phase.  This is an incredibly 
time consuming, research-intensive task, and I don't foresee having a 
Swahili->English dictionary that I am really happy with for another 
couple of years (at best - the thesaurus above, and my wife, would 
describe our current funding situation as "pauperized").

The Kamusi lexicon is much better as a Swa->Eng source than as an 
English->Swahili dictionary, because that is the direction in which 
we've input most of the initial data.  The magic of databases makes it 
possible to have our data available bi-directionally, but the E->S 
version of the Kamusi needs its own careful review.  That review can 
only come after the S->E data are thoroughly updated.  Most especially, 
precious few E->S entries have been arranged with the Grouping Tool ( 
http://research.yale.edu/swahili/serve_pages/groupingtool_en.php ), so 
most entries appear in an arbitrary order that does not account for 
homographs, differing senses, frequency, etc.  So, it would be premature 
to use the E->S Kamusi lexicon as a platform for machine translation, 
even though we do intend to get there.

When the data are ready for machine use, the program would also need to 
check the four "alternate spellings" fields, to pick up all the color v. 
colour issues that occur in both English and Swahili.  Also, I would 
think that you would want to keep part of speech info associated with 
each line, which would make it much easier to employ grammar rules.  A 
grammar hint: in Swahili, the adjective always comes *after* the noun 
that it modifies, except for the words "kila," "nusu," and "robo", and a 
few other cases, including the numbers preceding "elfu" for thousands 
between 11,000 and 99,000.

Another hint: Swahili does not use articles, so you need to get rid of 
most attempts at translations of a/ an/ the.  When an article is 
absolutely necessary (which a computer would have a difficult time 
predicting), Swahili uses variations of "one" for a/ an, and "that" for 
the.  Just getting rid of the articles in your articles would be a 100% 
improvement (bringing them up to 2% readable).

Ok, now assume we have good data, with a good way of predicting which 
words were appropriate in which circumstances (something that will 
eventually be aided by the work now being done toward building a central 
OmegaT database), and a good set of grammar rules.  You would still need 
to deal with the agglutinative Swahili verb in all its glory.  The 
Kamusi Project has a good parser embedded in our Swahili->English 
search, which disentangles the front end of any conjugated Swahili verb 
according to an analysis of every grammatical rule in the language.  (We 
have a similar analysis completed and written in pseudo-code for the 
back end, the verbal extensions, but ran out of money and had to lay off 
our programmer before we could code it into the search engine.)  Even 
taking advantage of our parser, your translating software would need to 
go the other way, building Swahili verbs from conjugated English verbs. 
  You would need to account for the noun classes of each noun that is 
referred to in the verb (as many as three different nouns, each of which 
is either one of four different conversational participants or belongs 
to one of 16 different noun classes), which involves trivial calls to 
our database once you've identified the appropriate elements in the 
English sentence and chosen the relevant nouns - the "class" field is 
the key here.  The real problem comes from conjugated English verbs. 
You need some way of knowing that "catches/ caught" relate to "catch," 
which would involve a database of English verbs and their irregular 
forms, and then you would need to map the various movable elements of 
the English sentence to the appropriate fixed points of the Swahili 
verb.  Not an impossible task to achieve to 90% over time, but not 
nearly as straightforward as you are hoping.

Of course, this is all for Swahili, for which we have a pretty good 
initial lexicon en route to becoming excellent, a complete description 
of grammatical rules, and an accepted, unicoded orthography.  Most other 
African languages, even those spoken by millions of people, are missing 
some or all of those elements in digital form.  So, even if you could 
get pretty good machine translation of Wikipedia for Swahili, you would 
still be a long, long way away from rolling with other languages.

And we still haven't dealt with content.  What's to say that content 
that is appropriate for the English Wikipedia is appropriate for the 
Swahili Wikipedia?  For example, the entry for Agriculture.  It begins 
by discussing the derivation of the word "agriculture," which is of 
course irrelevant for Swahili.  Then it carries an unacknowledged POV 
about modern agriculture (as though the vast numbers of Africans who 
earn their livings with hand hoes are pre-modern museum relics, and 
let's not even click on the link to "subsistence farming" that talks 
about "life outside of modern society"), and essentially ignores all of 
the issues of raising crops on small farms that would be of immediate 
interest to an African farmer logging in from an internet kiosk. 
(Comment to those who fear paternalism in this endeavor: the people I 
live and work among in Tanzania express a huge interest in having access 
to this sort of information, although they are not in a position to 
contribute to the development of the resource.)  So, an African farmer 
trying to combat an insect infestation on her farm would find a 
translation of an English "agriculture" article that focuses on 
technology-intensive farming to be much less useful than an article 
started almost from scratch that addressed farming in the context of 
speakers of that language.  It just happens that "agriculture" was the 
second article I clicked to by following links from the initial article 
on the pseudo-Swahili test site - what similar issues would arise on the 
fourth article, or the tenth, or the 997,032nd?

There's also the issue that a great many of the current English 
Wikipedia articles are works in progress, of varying quality.  Would you 
do a one time machine translation of the current Wikipedia, and ignore 
all future edits?  Translate only "stable versions?  Re-translate 
articles every time there is a change?  Re-translate every time the 
Kamusi Project data is updated (hundreds of times a week)?  Have the 
machine overwrite manual edits that someone did to machine translations, 
when the English version changes?  Do this for dozens of African 
languages, and hundreds of languages around the world?

I don't want to dismiss the entire endeavor, although I've been working 
on these issues for long enough to be sure that the undertaking is much 
more complicated than you're estimating.  Here's where I think your 
translation project might prove useful: if a speaker of, for example, 
Swahili went searching for an entry that didn't already exist in the 
Swahili Wikipedia, an application could build a version of that page 
on-the-fly from the English version that is current at that moment.  The 
Wikipedia user could then either (a) glean whatever information she 
could from the article and move on, (b) laugh uproariously, or (c) go 
into edit mode, work to turn the machine translation into something 
readable in Swahili, and save that version - which would then become the 
baseline page for that entry in that language, from which future edits 
could take off.  In this way, you would get the best of both worlds - 
good articles written in the actual language whenever possible, and 
fingertip access to rough machine translations from English when 
articles are not initially available in the target language.