[Foundation-l] Swahili Machine Translation First Run Completed for enwiki-20060817
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Tue Aug 29 07:50:52 UTC 2006
The first pass machine translation run of the English Wikipedia into the
Swahili Language has completed and is posted.
The translated XML dumps are posting to :
http://sw.wikigadugi.org
they will post throughout the night.
Lexicons can be downloaded from:
ftp://www.wikigadugi.org/africa/lexicon/swlexicon.public.bz2 - public
swahili lexicon
ftp://www.wikigadugi.org/africa/lexicon/swlexicon.kamusi.bz2 - kamusi
project lexicon
ftp://www.wikigadugi.org/africa/lexicon/sw.thesaurus.bz2 - rogets
thesaurus in swahili
MediaWiki Messages Files:
ftp://www.wikigadugi.org/africa/MediWiki/MessagesSW.php.bz2
Machine Translated XML Dumps against the ewiki-20060817 XMl Dumps from
the English Wikipedia:
ftp://www.wikigadugi.org/africa/xml/swphwiki-20060816-pages-articles.xml.bz2
This first run does NOT employe the verb stem decomposer and conjugator,
does NOT employ the grammar parser or sentence composer, does NOT
employ the AI Inference engine, and does not perform verb or noun
disambiguation as do the other machine translations as I have not
constructed
a decomposition rule set or grammar rules set for the translator. This
first run uses simple word by word translation and phrase matching with
hierarchical
thesaurus lookups and substitution.
This first pass is provided as an illustration of just how rapidly
Wikipedia can be translated into a target language. A swahili grammar
manual has been
overnighted to me and later this week I will perform another run with
grammar and sentence parsing rules. Since I am not a native speaker of
swahili, I request a native speaker to select 20 or more very long
articles and correc them. When I completed the disambiguator and
grammar rules
set for sentence construction, I will use the corrected articles to
teach the AI engine how to reorder and retense the translations. This
should get
the translations over 90% accurracy. Unlike Cherokee, swahili appears
to be a much simpler language for this task.
The Machine translation of swahili is a VERY early first run and is a
work in progress.
Jeffrey V. Merkey
More information about the wikimedia-l
mailing list