The Native Cherokee Language Translation Project has posted XML dumps against enwiki 06-19-2006 at
ftp.wikigadugi.org/wiki
Please feel free to download and review. This translation is using conjugation, and verb stem decomposition and reconstruction. Translation runs are posted in Sequoyah Syllabary and text phonetics. This release has improved the XML parser to support translation and auto-link generation, Image translation parsing, and templates. Since the English Wikipedia XML dumps appear to be supporting multilanguage tags, this version of the translator has been enabled to convert them into English and then Cherokee for image control statements (right, rucht, etc.).
The www.wikigadugi.org website has been used for XML translation import testing for the past several weeks while I corrected and added support for link translations and tuned the AI engine to compress, conjugate, and decompose and reconstruct verb stems and tensing, but the site is fully populated and will remain updated from now on.
The current translation is up to 92% Cherokee with only less than a 20MB word list left to be translated and tensed.
There have been several enhancements to the translation to detect and correct language drift between the various dialects.
There are four dialects of the Cherokee Language:
Otali (Overhill) - spoken in Oklahoma, 30,000 native speakers Giduwa (Keetoowah) - spoken in North Carolina, 5,000 native speakers Southern - formerly spoken in Southern Alabama, Southern Georgia and Florida (Extinct) Ahniyvwiya - spoken in New Mexico and Missouri (AniKutani), 500 native speakers (this dialect is the ancient written form of the Cherokee Language which uses the AniKutani Syllabary, and is used by the religious organization for record keeping. Since this dialect was written and numerous ancient texts exist, the modern spoken form has not experienced language drift. The Otali dialect has drifted due to English influences and sentence structure in the Otali dialect more resembles English than any of the other Cherokee dialects).
Of the four dialects, only the Giduwa and Ahniyvwiya dialects are still 100% mappable to the Sequoyah Syllabary. The Otali dialect is approximately 98% mappable however, Otali has contracted verb roots to the point they are no longer recognizable in their original form in many words due to language drift and synthesized newer sounds and hybridization with English.
"do" is now spoken "to" in many words in Otali "du" is now spoken "tu" in many words in Otali "l" has replaced "i" and in some words is a new consruct in the language as a result in Otali Many original inflections are now contracted and use English sounds rather than Syllabary constructs.
The Cherokee New Testament translated by Elias Boudinet and his associates in the early 1800's is one of the few surviving documents written in the Giduwa dialect and published before the language drift began in Oklahoma, and this older dialect is the most common dialect still understandable by most modern speakers who speak in Otali. This dialect forms the basis of this translation with common words from the Otali dialect which are still mappable to the Syllabary and corrected words with the original verb roots. This project has an additional purpose of forcing stnadardization in our immersion efforts to prevent further language drift by restandardizing all written works into the Giduwa dialect and conversion of non-conforming Otali words back to the Sequoyah Syllabary. Most of the modern Cherokee Language spoken in Oklahoma now uses an English Phoetics System devised by Cherokee Linguist Dr. Durbin Feeling of Oklahoma University in order to be written properly to reflect the modern spoken form and no longer are mappable to the Syllabary.
Internal discussions with Dr. Feeling and other members of this project have resulted in the conclusion that written Cherokee must use the Sequoyah Syllabary and we are planning to force standardization to correct this language drift in all translations. There have been several committees proposing expanding the syllabary to incorporate the new sounds, but these efforts may not address the issues of language drift. When a language is written and forced to use a standard alphabet or syllabary it typically does not drift much.
These translations now provide the following additional files which address these issues for our project folks and to force retranslation of Otali words into Ahniyvwiya or Giduwa dialects. Many of the words have been resynthesized into their original forms in order to be mapable to the Syllabary:
Each file set corresponds to a translation run of the wikitrans Cherokee Machine Translator.
cherokee-syntax-errors-<date>.txt.bz2 - words in Otali dialect not mappable to the syllabary but displayed in text phonetics in the translation phwiki-<date>-pages-articles.xml - text phonetic translation sylwiki-<date>-pages-articles.xml - Sequoyah Syllabary translation untranslated-log-<date>.txt.bz2 - words not yet translated or mapped to a thesaraus for their Cherokee equivalents
The end goal is to reduce the untranslated log file output to zero and achieve 100% by the end of the summer. There has been a lot of debate on the translation and which dialect structure would have the broadest coverage. Corrected Otali and Giduwa combined which remap to the syllabary have been chosen as the current standard for this effort.
The Machine translation is a work in progress.
Jeff Merkey
I have received several inquiries and requests from Wikimedia community members for release of the Cherokee lexicons for incorporation into the Wiktionary and several adjunct projects which support Wiktionary projects for the Native Cherokee Language.
Cherokee Language Lexicons have been released and posted to
ftp.wikigadugi.org/wiki/lexicon
otali-20060619.lex.bz2 giduwa-20060619.lex.bz2
The released lexicons are in text format and follow the format
<english>:Cherokee
for individual words.
These lexicons do not contain the verb parsing and decomposition rules for pure translation of the 14 tenses, and are simply dictionaries of common words used by most modern speakers. Complex sentence construction requires the AI inference engine which reorders english sentences into Cherokee contructs, then synthesizes the verb stem and pronoun modifiers for each phrase. These lexicons are however, an ideal beginning for contruction of a Wiktionary for the Cherokee Language.
Jeff V. Merkey
wikimedia-l@lists.wikimedia.org