Re: [Wikipedia-l] RFC: Principles of mass content adding on small Wikipedias

25 Feb 2006

What I wonder about TM is, how does it work with languages with
different structures?

It's quite obvious TM works well for Russian, Italian, Spanish,
French, German, other languages of similar structure. I heard it also
works for Chinese, Japanese, Korean, Arabic, Farsi, Hebrew as well.

So my main questions are:

1) Can it handle languages which don't separate words in writing?
Examples are Thai, Lao, Japanese, Chinese, and a number of smaller
languages.

2) Can it handle languages of all typological classifications? So far
I have seen it works well for isolating (such as Chinese, Vietnamese)
and inflecting languages (such as Russian, Polish, Latin), but what
about polysynthetic languages (such as Inuktitut, Turkish, Georgian,
Adyghe, Abkhaz, Mohawk)? I would imagine it would be more difficult
for these languages. For example, Western Greenlandic
"Aliikusersuillammassuaanerartassagaluarpaalli." means "However, they
will say that he is a great entertainer, but..." (for other long words
like this, just look at the greenlandic wikipedia, kl.wp).

3) Can it mass-process huge amounts of content quickly, to be reviewed
later by humans?

Mark

On 30/01/06, Sabine Cretella &lt;sabine_cretella(a)yahoo.it&gt; wrote:
...
  Hi, please allow me to add my 2 cts here :-)

 Danny mentioned in his response that a bot could do great work. Henna
 did remark that Wikidata could make a difference. Milos mentions that
 data may need localisation.. I want to remind you about an e-mail that
 Sabine Cretella send to the lists. Sabine is really active in the
 Neapolitan Wikipedia. A project much younger than the Swahili
 wikipedia but already with 4336 articles. The secret of this success
 is among other things that Sabine uses professional tools to translate
 into the Neapolitan language. OmegaT, the software Sabine uses, is GPL
 software and is what is called a CAT or Computer Aided Translation
 tool. This allows for an efficient translation and is /not /the same
 as Automated Translation. 
 I am using OmegaT for things I write in Neapolitan - for the simple
 fact, that during translation I get terminology proposed, already
 translated or similar sentences are proposed etc. This has nothing to do
 with Machine Translation. What is re-used here is a glossary and the
 translations previously done.

 The cities of Campania were uploaded with the pywikipediabot
 (pagefromfile.py). What is lacking here: a documentation of the single
 bots and how they work - so more people could learn how to use them
 since installing python and running the bot is not soooo difficult.

 Back to translations with OmegaT: the advantage for languages where we
 have a community that very often lacks in having a stable Internet
 access are obvious: translations can be planned, people who can
 translate, but who maybe would not be good writers, can translate
 offline and create contents that way. Another person can care about
 uploading the translated articles, if necessary. Once there is something
 written others, who are more likely to be writors than translators can
 start editing and adapting, improving etc.

 We have very different people around and not many of them are real good
 writers. I myself can write about certain stuff, but that is more
 marketing text than encyclopaedic text - so for the "translation" is the
 easiest way to do things.

 Steps to translate articles now:
 Copy the Text from the original language wiki in a file. (This can be
 one page or 50 pages - on file per wiki page or one file with several
 wiki pages in there ... it does not really matter)
 Create the project with OmegaT (well it must be on your computer first
 and you need java on your computer as well)
 Load the file to be translated
 Copy the glossaries you have for that language in the glossary directory.
 Copy eventual Translation Memories from previous projects in the
 directory for TMs
 Reload the project
 Translate
 Save to target
 Copy and paste the translated text to the wiki

 Now there are some more features that can help you - this is just to
 "see" how it basically works ... then like any application it has more
 features that can help you to easen work.

 People who are interested in using OmegaT for Wikipedia translations
 should show up. I know that is is being used for contents on scn
 wikipedia as well. Maybe it would make sense to create a work-group for
 this tool and exchange our TM's or upload them to Commons or another
 common place. Of course the Translation memory could also be stored on a
 wiki, but that is than another step ahead.

 When we have Wikidata ready for prime time, we will be able to store
 structured data about one subject. This is not a full solution as many
 of the words used in the presentation need to be translated, maybe
 even localised to make sense in another language. I for instance
 always have to think if 9/11 is the ninth of November or the eleventh
 of September; I do know of it for the event. In order to present data,
 labels have to be translated and data may have to be localised. The
 WiktionaryZ project will help with the labels and standards like the
 CLDR are what define how the localisation is to be done.

 We are making steady progress with WiktionaryZ, the first alpha demo
 project is at epov.org/wd-gemet/index.php/Main_Page (a read only
 project for now). There is a proposal for a project at
 http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help
 us where the standards prove to be not good enough. As Sabine is part
 of the team behind OmegaT, it is being researched how OmegaT can read
 and write directly to a Mediawiki project. 
 Well, see above, but even if it cannot do that now: it is very helpful
 to work on it locally and offline. The only thing I'd say is important
 is a note on the wikipedia/wikibooks whatever project of the target
 language that this article is being translated by XYZ.

 One other aspect that is needed in new project is commitment. People
 who express their support for a new language project should see this
 as an indication of /their /commitment and not as an expression of
 their opinion. When people start to work on a new project it is
 important that like on the Neapolitan wikipedia, there are people who
 are knowledgeable and willing to help the newbies, I hope that the IRC
 channel #wikipedia-bootcamp can serve a role for this as well. 
 Hmmm ... Danke für die Blumen :-) but really there are so many projects
 that do good :-) very often people just don't know about existing tools
 or whatever and often these could be the things that make the difference.

 I always like to take the African languages as an example, because they
 somewhat have the most difficult position:
 People there often do not have the same possibilities we have... well:
 even if students don't have a stable Internet connection they could well
 have huge advantages of offline versions on a CD or versions that can be
 copied on a hd or wherever you can easily store and re-write data for a
 pc. I know that often they do not even have computers like we have.
 There is that 100 usd computer for schools ... well this one would be a
 good approach also for school kids in Europe ... from the first class
 onwards (thinking about kids here where I live) ... but let's go back to
 "offline use and work"

 Well considering those regions where "online" is a problem students and
 capable people who would like to see their language present in wikipedia
 could become pages to translate - even a non-native speaker could care
 about the upload of the pages. And there should be one person who
 speaks/writes enough of that language to add pictures, sound etc. Then
 the ready article can go back to them, and they can distribute it among
 people.

 I know, now you say Wikipedia is an "online encyclopaedia" ... well, it
 might be ... but does this mean that people who do not have stable
 online access and who cannot afford it are second class and therefore
 should not get information in their language? Why should they not be
 allowed to work on such contents, read it, distribute it ...

 Wikipedia is about giving people encyclopaedic information in their
 language - well I see its scope much wider than that - but basically
 that's it ... so we should give it to them, right?

 OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?

 There would be so much more to write about these things .... well let's
 take this thread as a good start :-) and let us develop strategies and
 work together.

 Best,

 Sabine

 *****and there is my never ending wish ... that day of at least 48 hours
 ...*****

 ___________________________________
 Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
 http://mail.yahoo.it
 _______________________________________________
 Wikipedia-l mailing list
 Wikipedia-l(a)Wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

--
"Take away their language, destroy their souls." -- Joseph Stalin

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] RFC: Principles of mass content adding on small Wikipedias