[Foundation-l] Swahili Machine Translation First Run Completed for enwiki-20060817

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Tue Aug 29 07:50:52 UTC 2006


The first pass machine translation run of the English Wikipedia into the 
Swahili Language has completed and is posted.
The translated XML dumps are posting to :

http://sw.wikigadugi.org

they will post throughout the night.

Lexicons can be downloaded from:

ftp://www.wikigadugi.org/africa/lexicon/swlexicon.public.bz2  - public 
swahili lexicon
ftp://www.wikigadugi.org/africa/lexicon/swlexicon.kamusi.bz2 - kamusi 
project lexicon
ftp://www.wikigadugi.org/africa/lexicon/sw.thesaurus.bz2 - rogets 
thesaurus in swahili

MediaWiki Messages Files:

ftp://www.wikigadugi.org/africa/MediWiki/MessagesSW.php.bz2

Machine Translated XML Dumps against the ewiki-20060817 XMl Dumps from 
the English Wikipedia:

ftp://www.wikigadugi.org/africa/xml/swphwiki-20060816-pages-articles.xml.bz2

This first run does NOT employe the verb stem decomposer and conjugator, 
does NOT employ the grammar parser or sentence composer, does NOT
employ the AI Inference engine, and does not perform verb or noun 
disambiguation as do the other machine translations as I have not 
constructed
a decomposition rule set or grammar rules set for the translator.  This 
first run uses simple word by word translation and phrase matching with 
hierarchical
thesaurus lookups and substitution.

This first pass is provided as an illustration of just how rapidly 
Wikipedia can be translated into a target language.  A swahili grammar 
manual has been
overnighted to me and later this week I will perform another run with 
grammar and sentence parsing rules.  Since I am not a native speaker of
swahili, I request a native speaker to select 20 or more very long 
articles and correc them.  When I completed the disambiguator and 
grammar rules
set for sentence construction, I will use the corrected articles to 
teach the AI engine how to reorder and retense the translations.  This 
should get
the translations over 90% accurracy.  Unlike Cherokee, swahili appears 
to be a much simpler language for this task.

The Machine translation of swahili is a VERY early first run and is a 
work in progress.

Jeffrey V. Merkey






More information about the wikimedia-l mailing list