[Foundation-l] Swahili Machine Translation First Run Completed for enwiki-20060817
Martin Benjamin
martin.benjamin at aya.yale.edu
Tue Aug 29 17:53:10 UTC 2006
Jeff,
I applaud you for your initiative - your effort is impressive, albeit
unreadable. I'll give my feedback in this post, and then suggest we
take the discussion of the specifics of Swahili translation off-list
(and welcome others who want to keep track of this thread to email us to
stay in the cc loop). The last 2 or 3 paragraphs of this post do speak
to the wider discussion list, so other readers might wish to SKIP TOWARD
THE BOTTOM.
The first problem derives from your sources. The first source, "public
swahili lexicon," is a useless set of about 1000 nouns, adjectives, and
conjunctions, essentially a tourist vocabulary without any verbs. I
would be surprised if that list gave any pairings that weren't also in
the other lists. The third source, "rogets thesaurus in swahili," is
one I would like to know more about, but is not useful for machine
translation purposes in the configuration you've set up - for example,
scroll down to line 51382, and look at the following 100-odd pairs for
"idhini" in no particular order, with no way to distinguish among parts
of speech, shades of meaning, relative frequency, etc. However, I was
heartened to see line 45405 and following; I'm sure that if any
wikipedia entries need to be translated that include "assify,"
"torpedinous," or "macht nichts," this thesaurus will prove quite handy.
It looks like someone started with a smallish Swahili-English
wordlist, plugged that into an English thesaurus, and extrapolated
dozens of additional English equivalents per word, yielding an
intriguing but lexicographically suspect set of equivalencies.
Which leads us to the Kamusi Project as a source. I will be the first
to say that the Kamusi is a pretty good Swahili dictionary that will one
day be a great Swahili dictionary, but at the moment contains
significant weaknesses that prevent it from being a reliable source for
machine translation. The first issue is the quality of the data. The
initial data were manually input from an existing print dictionary to
which we were granted copyright permission. Unfortunately, the students
entering the data, before we programmed the Edit Engine, introduced a
lot of errors. I am currently in the process of going through the
database entry by entry, fixing those errors and adding in new heaps of
data, including information for many data fields that we hadn't
introduced during the initial data entry phase. This is an incredibly
time consuming, research-intensive task, and I don't foresee having a
Swahili->English dictionary that I am really happy with for another
couple of years (at best - the thesaurus above, and my wife, would
describe our current funding situation as "pauperized").
The Kamusi lexicon is much better as a Swa->Eng source than as an
English->Swahili dictionary, because that is the direction in which
we've input most of the initial data. The magic of databases makes it
possible to have our data available bi-directionally, but the E->S
version of the Kamusi needs its own careful review. That review can
only come after the S->E data are thoroughly updated. Most especially,
precious few E->S entries have been arranged with the Grouping Tool (
http://research.yale.edu/swahili/serve_pages/groupingtool_en.php ), so
most entries appear in an arbitrary order that does not account for
homographs, differing senses, frequency, etc. So, it would be premature
to use the E->S Kamusi lexicon as a platform for machine translation,
even though we do intend to get there.
When the data are ready for machine use, the program would also need to
check the four "alternate spellings" fields, to pick up all the color v.
colour issues that occur in both English and Swahili. Also, I would
think that you would want to keep part of speech info associated with
each line, which would make it much easier to employ grammar rules. A
grammar hint: in Swahili, the adjective always comes *after* the noun
that it modifies, except for the words "kila," "nusu," and "robo", and a
few other cases, including the numbers preceding "elfu" for thousands
between 11,000 and 99,000.
Another hint: Swahili does not use articles, so you need to get rid of
most attempts at translations of a/ an/ the. When an article is
absolutely necessary (which a computer would have a difficult time
predicting), Swahili uses variations of "one" for a/ an, and "that" for
the. Just getting rid of the articles in your articles would be a 100%
improvement (bringing them up to 2% readable).
Ok, now assume we have good data, with a good way of predicting which
words were appropriate in which circumstances (something that will
eventually be aided by the work now being done toward building a central
OmegaT database), and a good set of grammar rules. You would still need
to deal with the agglutinative Swahili verb in all its glory. The
Kamusi Project has a good parser embedded in our Swahili->English
search, which disentangles the front end of any conjugated Swahili verb
according to an analysis of every grammatical rule in the language. (We
have a similar analysis completed and written in pseudo-code for the
back end, the verbal extensions, but ran out of money and had to lay off
our programmer before we could code it into the search engine.) Even
taking advantage of our parser, your translating software would need to
go the other way, building Swahili verbs from conjugated English verbs.
You would need to account for the noun classes of each noun that is
referred to in the verb (as many as three different nouns, each of which
is either one of four different conversational participants or belongs
to one of 16 different noun classes), which involves trivial calls to
our database once you've identified the appropriate elements in the
English sentence and chosen the relevant nouns - the "class" field is
the key here. The real problem comes from conjugated English verbs.
You need some way of knowing that "catches/ caught" relate to "catch,"
which would involve a database of English verbs and their irregular
forms, and then you would need to map the various movable elements of
the English sentence to the appropriate fixed points of the Swahili
verb. Not an impossible task to achieve to 90% over time, but not
nearly as straightforward as you are hoping.
Of course, this is all for Swahili, for which we have a pretty good
initial lexicon en route to becoming excellent, a complete description
of grammatical rules, and an accepted, unicoded orthography. Most other
African languages, even those spoken by millions of people, are missing
some or all of those elements in digital form. So, even if you could
get pretty good machine translation of Wikipedia for Swahili, you would
still be a long, long way away from rolling with other languages.
And we still haven't dealt with content. What's to say that content
that is appropriate for the English Wikipedia is appropriate for the
Swahili Wikipedia? For example, the entry for Agriculture. It begins
by discussing the derivation of the word "agriculture," which is of
course irrelevant for Swahili. Then it carries an unacknowledged POV
about modern agriculture (as though the vast numbers of Africans who
earn their livings with hand hoes are pre-modern museum relics, and
let's not even click on the link to "subsistence farming" that talks
about "life outside of modern society"), and essentially ignores all of
the issues of raising crops on small farms that would be of immediate
interest to an African farmer logging in from an internet kiosk.
(Comment to those who fear paternalism in this endeavor: the people I
live and work among in Tanzania express a huge interest in having access
to this sort of information, although they are not in a position to
contribute to the development of the resource.) So, an African farmer
trying to combat an insect infestation on her farm would find a
translation of an English "agriculture" article that focuses on
technology-intensive farming to be much less useful than an article
started almost from scratch that addressed farming in the context of
speakers of that language. It just happens that "agriculture" was the
second article I clicked to by following links from the initial article
on the pseudo-Swahili test site - what similar issues would arise on the
fourth article, or the tenth, or the 997,032nd?
There's also the issue that a great many of the current English
Wikipedia articles are works in progress, of varying quality. Would you
do a one time machine translation of the current Wikipedia, and ignore
all future edits? Translate only "stable versions? Re-translate
articles every time there is a change? Re-translate every time the
Kamusi Project data is updated (hundreds of times a week)? Have the
machine overwrite manual edits that someone did to machine translations,
when the English version changes? Do this for dozens of African
languages, and hundreds of languages around the world?
I don't want to dismiss the entire endeavor, although I've been working
on these issues for long enough to be sure that the undertaking is much
more complicated than you're estimating. Here's where I think your
translation project might prove useful: if a speaker of, for example,
Swahili went searching for an entry that didn't already exist in the
Swahili Wikipedia, an application could build a version of that page
on-the-fly from the English version that is current at that moment. The
Wikipedia user could then either (a) glean whatever information she
could from the article and move on, (b) laugh uproariously, or (c) go
into edit mode, work to turn the machine translation into something
readable in Swahili, and save that version - which would then become the
baseline page for that entry in that language, from which future edits
could take off. In this way, you would get the best of both worlds -
good articles written in the actual language whenever possible, and
fingertip access to rough machine translations from English when
articles are not initially available in the target language.
More information about the wikimedia-l
mailing list