What I wonder about TM is, how does it work with languages with
It's quite obvious TM works well for Russian, Italian, Spanish,
French, German, other languages of similar structure. I heard it also
works for Chinese, Japanese, Korean, Arabic, Farsi, Hebrew as well.
So my main questions are:
1) Can it handle languages which don't separate words in writing?
Examples are Thai, Lao, Japanese, Chinese, and a number of smaller
2) Can it handle languages of all typological classifications? So far
I have seen it works well for isolating (such as Chinese, Vietnamese)
and inflecting languages (such as Russian, Polish, Latin), but what
about polysynthetic languages (such as Inuktitut, Turkish, Georgian,
Adyghe, Abkhaz, Mohawk)? I would imagine it would be more difficult
for these languages. For example, Western Greenlandic
"Aliikusersuillammassuaanerartassagaluarpaalli." means "However, they
will say that he is a great entertainer, but..." (for other long words
like this, just look at the greenlandic wikipedia, kl.wp).
3) Can it mass-process huge amounts of content quickly, to be reviewed
later by humans?
On 30/01/06, Sabine Cretella <sabine_cretella(a)yahoo.it> wrote:
Hi, please allow me to add my 2 cts here :-)
Danny mentioned in his response that a bot could do great work. Henna
did remark that Wikidata could make a difference. Milos mentions that
data may need localisation.. I want to remind you about an e-mail that
Sabine Cretella send to the lists. Sabine is really active in the
Neapolitan Wikipedia. A project much younger than the Swahili
wikipedia but already with 4336 articles. The secret of this success
is among other things that Sabine uses professional tools to translate
into the Neapolitan language. OmegaT, the software Sabine uses, is GPL
software and is what is called a CAT or Computer Aided Translation
tool. This allows for an efficient translation and is /not /the same
as Automated Translation.
I am using OmegaT for things I write in Neapolitan - for the simple
fact, that during translation I get terminology proposed, already
translated or similar sentences are proposed etc. This has nothing to do
with Machine Translation. What is re-used here is a glossary and the
translations previously done.
The cities of Campania were uploaded with the pywikipediabot
(pagefromfile.py). What is lacking here: a documentation of the single
bots and how they work - so more people could learn how to use them
since installing python and running the bot is not soooo difficult.
Back to translations with OmegaT: the advantage for languages where we
have a community that very often lacks in having a stable Internet
access are obvious: translations can be planned, people who can
translate, but who maybe would not be good writers, can translate
offline and create contents that way. Another person can care about
uploading the translated articles, if necessary. Once there is something
written others, who are more likely to be writors than translators can
start editing and adapting, improving etc.
We have very different people around and not many of them are real good
writers. I myself can write about certain stuff, but that is more
marketing text than encyclopaedic text - so for the "translation" is the
easiest way to do things.
Steps to translate articles now:
Copy the Text from the original language wiki in a file. (This can be
one page or 50 pages - on file per wiki page or one file with several
wiki pages in there ... it does not really matter)
Create the project with OmegaT (well it must be on your computer first
and you need java on your computer as well)
Load the file to be translated
Copy the glossaries you have for that language in the glossary directory.
Copy eventual Translation Memories from previous projects in the
directory for TMs
Reload the project
Save to target
Copy and paste the translated text to the wiki
Now there are some more features that can help you - this is just to
"see" how it basically works ... then like any application it has more
features that can help you to easen work.
People who are interested in using OmegaT for Wikipedia translations
should show up. I know that is is being used for contents on scn
wikipedia as well. Maybe it would make sense to create a work-group for
this tool and exchange our TM's or upload them to Commons or another
common place. Of course the Translation memory could also be stored on a
wiki, but that is than another step ahead.
When we have Wikidata ready for prime time, we will be able to store
structured data about one subject. This is not a full solution as many
of the words used in the presentation need to be translated, maybe
even localised to make sense in another language. I for instance
always have to think if 9/11 is the ninth of November or the eleventh
of September; I do know of it for the event. In order to present data,
labels have to be translated and data may have to be localised. The
WiktionaryZ project will help with the labels and standards like the
CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo
project is at epov.org/wd-gemet/index.php/Main_Page
(a read only
project for now). There is a proposal for a project at
that intends to help
us where the standards prove to be not good enough. As Sabine is part
of the team behind OmegaT, it is being researched how OmegaT can read
and write directly to a Mediawiki project.
Well, see above, but even if it cannot do that now: it is very helpful
to work on it locally and offline. The only thing I'd say is important
is a note on the wikipedia/wikibooks whatever project of the target
language that this article is being translated by XYZ.
One other aspect that is needed in new project is commitment. People
who express their support for a new language project should see this
as an indication of /their /commitment and not as an expression of
their opinion. When people start to work on a new project it is
important that like on the Neapolitan wikipedia, there are people who
are knowledgeable and willing to help the newbies, I hope that the IRC
channel #wikipedia-bootcamp can serve a role for this as well.
Hmmm ... Danke für die Blumen :-) but really there are so many projects
that do good :-) very often people just don't know about existing tools
or whatever and often these could be the things that make the difference.
I always like to take the African languages as an example, because they
somewhat have the most difficult position:
People there often do not have the same possibilities we have... well:
even if students don't have a stable Internet connection they could well
have huge advantages of offline versions on a CD or versions that can be
copied on a hd or wherever you can easily store and re-write data for a
pc. I know that often they do not even have computers like we have.
There is that 100 usd computer for schools ... well this one would be a
good approach also for school kids in Europe ... from the first class
onwards (thinking about kids here where I live) ... but let's go back to
"offline use and work"
Well considering those regions where "online" is a problem students and
capable people who would like to see their language present in wikipedia
could become pages to translate - even a non-native speaker could care
about the upload of the pages. And there should be one person who
speaks/writes enough of that language to add pictures, sound etc. Then
the ready article can go back to them, and they can distribute it among
I know, now you say Wikipedia is an "online encyclopaedia" ... well, it
might be ... but does this mean that people who do not have stable
online access and who cannot afford it are second class and therefore
should not get information in their language? Why should they not be
allowed to work on such contents, read it, distribute it ...
Wikipedia is about giving people encyclopaedic information in their
language - well I see its scope much wider than that - but basically
that's it ... so we should give it to them, right?
OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?
There would be so much more to write about these things .... well let's
take this thread as a good start :-) and let us develop strategies and
*****and there is my never ending wish ... that day of at least 48 hours
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
Wikipedia-l mailing list
"Take away their language, destroy their souls." -- Joseph Stalin