[Foundation-l] Native Cherokee XML Dumps 20060619 Posted

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Sat Jun 24 05:27:11 UTC 2006


The Native Cherokee Language Translation Project has posted XML dumps 
against enwiki 06-19-2006 at

ftp.wikigadugi.org/wiki

Please feel free to download and review. This translation is using 
conjugation, and verb stem decomposition and reconstruction.
Translation runs are posted in Sequoyah Syllabary and text phonetics.   
This release has improved the XML parser to support
translation and auto-link generation, Image translation parsing, and 
templates.   Since the English Wikipedia XML dumps appear to
be supporting multilanguage tags, this version of the translator has 
been enabled to convert them into English and then Cherokee
for image control statements (right, rucht, etc.).

The www.wikigadugi.org website has been used for XML translation import 
testing for the past several weeks while I corrected
 and added support for link translations and tuned the AI engine to 
compress, conjugate, and decompose and reconstruct verb stems and
tensing, but the site is fully populated and will remain updated from 
now on.   

The current translation is up to 92% Cherokee with only less than a 20MB 
word list left to be translated and tensed.

There have been several enhancements to the translation to detect and 
correct language drift between the various dialects. 

There are four dialects of the Cherokee Language:

Otali (Overhill) - spoken in Oklahoma, 30,000 native speakers
Giduwa (Keetoowah) - spoken in North Carolina, 5,000 native speakers
Southern - formerly spoken in Southern Alabama, Southern Georgia and 
Florida (Extinct)
Ahniyvwiya - spoken in New Mexico and Missouri (AniKutani), 500 native 
speakers (this dialect is the ancient written
form of the Cherokee Language which uses the AniKutani Syllabary, and is 
used by the religious organization
for record keeping.  Since this dialect was written and numerous ancient 
texts exist, the modern spoken form has not
experienced language drift.  The Otali dialect has drifted due to 
English influences and sentence structure in the
Otali dialect more resembles English than any of the other Cherokee 
dialects).
 
Of the four dialects, only the Giduwa and Ahniyvwiya dialects  are still 
100% mappable to the Sequoyah Syllabary.
The Otali dialect is approximately 98% mappable however, Otali has 
contracted verb roots to the point they are
no longer recognizable in their original form in many words due to 
language drift and synthesized newer sounds
and hybridization with English. 

"do" is now spoken "to" in many words in Otali
"du" is now spoken "tu" in many words in Otali
"l" has replaced "i" and in some words is a new consruct in the language 
as a result in Otali
Many original inflections are now contracted and use English sounds 
rather than Syllabary constructs.

The Cherokee New Testament translated by Elias Boudinet and his 
associates in the early 1800's is one of the
few surviving documents written in the Giduwa dialect and published 
before the language drift began in
Oklahoma, and this older dialect is the most common dialect still 
understandable by most modern speakers who speak
in Otali.  This dialect forms the basis of this translation with common 
words from the Otali dialect
which are still mappable to the Syllabary and corrected words with the 
original verb roots.     This project has an
additional purpose of forcing  stnadardization in our immersion efforts 
to prevent further language drift by
restandardizing all written works into the Giduwa dialect and conversion of
non-conforming Otali words back to the Sequoyah Syllabary.  Most of the 
modern
Cherokee Language spoken in Oklahoma now uses an English Phoetics System 
devised by Cherokee Linguist
Dr. Durbin Feeling of  Oklahoma University in order to be written 
properly to reflect the modern spoken form and
no longer are mappable to the Syllabary.   

Internal discussions with Dr. Feeling and other members of this project 
have resulted in the conclusion that written Cherokee must use
the Sequoyah Syllabary and we are planning to force standardization to 
correct this language drift in all translations.  There have been
several committees proposing expanding the syllabary to incorporate the 
new sounds, but these efforts may not address the issues of
language drift.  When a language is written and forced to use a standard 
alphabet or syllabary it typically does not drift much.

These translations now provide the following additional files which 
address these issues for our project folks and to force
retranslation of Otali words into Ahniyvwiya or Giduwa dialects.  Many 
of the words have been resynthesized into
their original forms in order to be mapable to the Syllabary:

Each file set corresponds to a translation run of the wikitrans Cherokee 
Machine Translator.

cherokee-syntax-errors-<date>.txt.bz2  - words in Otali dialect not 
mappable to the syllabary but displayed in text phonetics in the translation
phwiki-<date>-pages-articles.xml - text phonetic translation
sylwiki-<date>-pages-articles.xml - Sequoyah Syllabary translation
untranslated-log-<date>.txt.bz2 - words not yet translated or mapped to 
a thesaraus for their Cherokee equivalents

The end goal is to reduce the untranslated log file output to zero and 
achieve 100% by the end of the summer.   There has been a lot of debate on
the translation and which  dialect structure would have the broadest 
coverage.    Corrected Otali and Giduwa combined which remap to the
syllabary have been chosen as the current standard for this effort.

The Machine translation is a work in progress.

Jeff Merkey

>  
>



More information about the wikimedia-l mailing list