[Foundation-l] Native Cherokee XML Dumps 20060619 Posted
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Sat Jun 24 05:27:11 UTC 2006
The Native Cherokee Language Translation Project has posted XML dumps
against enwiki 06-19-2006 at
ftp.wikigadugi.org/wiki
Please feel free to download and review. This translation is using
conjugation, and verb stem decomposition and reconstruction.
Translation runs are posted in Sequoyah Syllabary and text phonetics.
This release has improved the XML parser to support
translation and auto-link generation, Image translation parsing, and
templates. Since the English Wikipedia XML dumps appear to
be supporting multilanguage tags, this version of the translator has
been enabled to convert them into English and then Cherokee
for image control statements (right, rucht, etc.).
The www.wikigadugi.org website has been used for XML translation import
testing for the past several weeks while I corrected
and added support for link translations and tuned the AI engine to
compress, conjugate, and decompose and reconstruct verb stems and
tensing, but the site is fully populated and will remain updated from
now on.
The current translation is up to 92% Cherokee with only less than a 20MB
word list left to be translated and tensed.
There have been several enhancements to the translation to detect and
correct language drift between the various dialects.
There are four dialects of the Cherokee Language:
Otali (Overhill) - spoken in Oklahoma, 30,000 native speakers
Giduwa (Keetoowah) - spoken in North Carolina, 5,000 native speakers
Southern - formerly spoken in Southern Alabama, Southern Georgia and
Florida (Extinct)
Ahniyvwiya - spoken in New Mexico and Missouri (AniKutani), 500 native
speakers (this dialect is the ancient written
form of the Cherokee Language which uses the AniKutani Syllabary, and is
used by the religious organization
for record keeping. Since this dialect was written and numerous ancient
texts exist, the modern spoken form has not
experienced language drift. The Otali dialect has drifted due to
English influences and sentence structure in the
Otali dialect more resembles English than any of the other Cherokee
dialects).
Of the four dialects, only the Giduwa and Ahniyvwiya dialects are still
100% mappable to the Sequoyah Syllabary.
The Otali dialect is approximately 98% mappable however, Otali has
contracted verb roots to the point they are
no longer recognizable in their original form in many words due to
language drift and synthesized newer sounds
and hybridization with English.
"do" is now spoken "to" in many words in Otali
"du" is now spoken "tu" in many words in Otali
"l" has replaced "i" and in some words is a new consruct in the language
as a result in Otali
Many original inflections are now contracted and use English sounds
rather than Syllabary constructs.
The Cherokee New Testament translated by Elias Boudinet and his
associates in the early 1800's is one of the
few surviving documents written in the Giduwa dialect and published
before the language drift began in
Oklahoma, and this older dialect is the most common dialect still
understandable by most modern speakers who speak
in Otali. This dialect forms the basis of this translation with common
words from the Otali dialect
which are still mappable to the Syllabary and corrected words with the
original verb roots. This project has an
additional purpose of forcing stnadardization in our immersion efforts
to prevent further language drift by
restandardizing all written works into the Giduwa dialect and conversion of
non-conforming Otali words back to the Sequoyah Syllabary. Most of the
modern
Cherokee Language spoken in Oklahoma now uses an English Phoetics System
devised by Cherokee Linguist
Dr. Durbin Feeling of Oklahoma University in order to be written
properly to reflect the modern spoken form and
no longer are mappable to the Syllabary.
Internal discussions with Dr. Feeling and other members of this project
have resulted in the conclusion that written Cherokee must use
the Sequoyah Syllabary and we are planning to force standardization to
correct this language drift in all translations. There have been
several committees proposing expanding the syllabary to incorporate the
new sounds, but these efforts may not address the issues of
language drift. When a language is written and forced to use a standard
alphabet or syllabary it typically does not drift much.
These translations now provide the following additional files which
address these issues for our project folks and to force
retranslation of Otali words into Ahniyvwiya or Giduwa dialects. Many
of the words have been resynthesized into
their original forms in order to be mapable to the Syllabary:
Each file set corresponds to a translation run of the wikitrans Cherokee
Machine Translator.
cherokee-syntax-errors-<date>.txt.bz2 - words in Otali dialect not
mappable to the syllabary but displayed in text phonetics in the translation
phwiki-<date>-pages-articles.xml - text phonetic translation
sylwiki-<date>-pages-articles.xml - Sequoyah Syllabary translation
untranslated-log-<date>.txt.bz2 - words not yet translated or mapped to
a thesaraus for their Cherokee equivalents
The end goal is to reduce the untranslated log file output to zero and
achieve 100% by the end of the summer. There has been a lot of debate on
the translation and which dialect structure would have the broadest
coverage. Corrected Otali and Giduwa combined which remap to the
syllabary have been chosen as the current standard for this effort.
The Machine translation is a work in progress.
Jeff Merkey
>
>
More information about the wikimedia-l
mailing list