What I wonder about TM is, how does it work with languages with different structures?
It's quite obvious TM works well for Russian, Italian, Spanish, French, German, other languages of similar structure. I heard it also works for Chinese, Japanese, Korean, Arabic, Farsi, Hebrew as well.
So my main questions are:
1) Can it handle languages which don't separate words in writing? Examples are Thai, Lao, Japanese, Chinese, and a number of smaller languages.
2) Can it handle languages of all typological classifications? So far I have seen it works well for isolating (such as Chinese, Vietnamese) and inflecting languages (such as Russian, Polish, Latin), but what about polysynthetic languages (such as Inuktitut, Turkish, Georgian, Adyghe, Abkhaz, Mohawk)? I would imagine it would be more difficult for these languages. For example, Western Greenlandic "Aliikusersuillammassuaanerartassagaluarpaalli." means "However, they will say that he is a great entertainer, but..." (for other long words like this, just look at the greenlandic wikipedia, kl.wp).
3) Can it mass-process huge amounts of content quickly, to be reviewed later by humans?
Mark
On 30/01/06, Sabine Cretella sabine_cretella@yahoo.it wrote:
Hi, please allow me to add my 2 cts here :-)
Danny mentioned in his response that a bot could do great work. Henna did remark that Wikidata could make a difference. Milos mentions that data may need localisation.. I want to remind you about an e-mail that Sabine Cretella send to the lists. Sabine is really active in the Neapolitan Wikipedia. A project much younger than the Swahili wikipedia but already with 4336 articles. The secret of this success is among other things that Sabine uses professional tools to translate into the Neapolitan language. OmegaT, the software Sabine uses, is GPL software and is what is called a CAT or Computer Aided Translation tool. This allows for an efficient translation and is /not /the same as Automated Translation.
I am using OmegaT for things I write in Neapolitan - for the simple fact, that during translation I get terminology proposed, already translated or similar sentences are proposed etc. This has nothing to do with Machine Translation. What is re-used here is a glossary and the translations previously done.
The cities of Campania were uploaded with the pywikipediabot (pagefromfile.py). What is lacking here: a documentation of the single bots and how they work - so more people could learn how to use them since installing python and running the bot is not soooo difficult.
Back to translations with OmegaT: the advantage for languages where we have a community that very often lacks in having a stable Internet access are obvious: translations can be planned, people who can translate, but who maybe would not be good writers, can translate offline and create contents that way. Another person can care about uploading the translated articles, if necessary. Once there is something written others, who are more likely to be writors than translators can start editing and adapting, improving etc.
We have very different people around and not many of them are real good writers. I myself can write about certain stuff, but that is more marketing text than encyclopaedic text - so for the "translation" is the easiest way to do things.
Steps to translate articles now: Copy the Text from the original language wiki in a file. (This can be one page or 50 pages - on file per wiki page or one file with several wiki pages in there ... it does not really matter) Create the project with OmegaT (well it must be on your computer first and you need java on your computer as well) Load the file to be translated Copy the glossaries you have for that language in the glossary directory. Copy eventual Translation Memories from previous projects in the directory for TMs Reload the project Translate Save to target Copy and paste the translated text to the wiki
Now there are some more features that can help you - this is just to "see" how it basically works ... then like any application it has more features that can help you to easen work.
People who are interested in using OmegaT for Wikipedia translations should show up. I know that is is being used for contents on scn wikipedia as well. Maybe it would make sense to create a work-group for this tool and exchange our TM's or upload them to Commons or another common place. Of course the Translation memory could also be stored on a wiki, but that is than another step ahead.
When we have Wikidata ready for prime time, we will be able to store structured data about one subject. This is not a full solution as many of the words used in the presentation need to be translated, maybe even localised to make sense in another language. I for instance always have to think if 9/11 is the ninth of November or the eleventh of September; I do know of it for the event. In order to present data, labels have to be translated and data may have to be localised. The WiktionaryZ project will help with the labels and standards like the CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo project is at epov.org/wd-gemet/index.php/Main_Page (a read only project for now). There is a proposal for a project at http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help us where the standards prove to be not good enough. As Sabine is part of the team behind OmegaT, it is being researched how OmegaT can read and write directly to a Mediawiki project.
Well, see above, but even if it cannot do that now: it is very helpful to work on it locally and offline. The only thing I'd say is important is a note on the wikipedia/wikibooks whatever project of the target language that this article is being translated by XYZ.
One other aspect that is needed in new project is commitment. People who express their support for a new language project should see this as an indication of /their /commitment and not as an expression of their opinion. When people start to work on a new project it is important that like on the Neapolitan wikipedia, there are people who are knowledgeable and willing to help the newbies, I hope that the IRC channel #wikipedia-bootcamp can serve a role for this as well.
Hmmm ... Danke für die Blumen :-) but really there are so many projects that do good :-) very often people just don't know about existing tools or whatever and often these could be the things that make the difference.
I always like to take the African languages as an example, because they somewhat have the most difficult position: People there often do not have the same possibilities we have... well: even if students don't have a stable Internet connection they could well have huge advantages of offline versions on a CD or versions that can be copied on a hd or wherever you can easily store and re-write data for a pc. I know that often they do not even have computers like we have. There is that 100 usd computer for schools ... well this one would be a good approach also for school kids in Europe ... from the first class onwards (thinking about kids here where I live) ... but let's go back to "offline use and work"
Well considering those regions where "online" is a problem students and capable people who would like to see their language present in wikipedia could become pages to translate - even a non-native speaker could care about the upload of the pages. And there should be one person who speaks/writes enough of that language to add pictures, sound etc. Then the ready article can go back to them, and they can distribute it among people.
I know, now you say Wikipedia is an "online encyclopaedia" ... well, it might be ... but does this mean that people who do not have stable online access and who cannot afford it are second class and therefore should not get information in their language? Why should they not be allowed to work on such contents, read it, distribute it ...
Wikipedia is about giving people encyclopaedic information in their language - well I see its scope much wider than that - but basically that's it ... so we should give it to them, right?
OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?
There would be so much more to write about these things .... well let's take this thread as a good start :-) and let us develop strategies and work together.
Best,
Sabine
*****and there is my never ending wish ... that day of at least 48 hours ...*****
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- "Take away their language, destroy their souls." -- Joseph Stalin