Maybe this should go on Meta, but I want to see comments here, first.
As I can see, there are two ways of mass content adding. The first one includes generation of articles based on some public data (for example NASA, National Geospatial Inteligence Agency, French government etc.) Now, this is almost usual way for mass content adding and I think that a number of us have some experience with such work.
The other way is adding content using English Wikipedia. English Wikipedia has a lot of categorized articles, a lot of templates etc. All these typical forms can be used for automatic content creation on small Wikipedias.
I think that idea of having a thousends of articles with a couple of sentences and good categorization about a lot of fields -- can be very helpful not only to small Wikipedias, but also for spreading free knowledge. I think that it would be a great day for us when people which native language is Mongolian will be able to read about places in Amazon and movies from Australia in their native language. And, this is possible to do much faster then we think.
And not only that: bots should be able to update information; bots should be able to do more things through time. Finally, it would be possible to start with knowledge transfer between Wikipedias in different languages: if we have the same methodology on different Wikipedias, we would be able to update data semi-automatic (up to full automatic).
However, this needs a number of people who are interested in such project:
(1) We would need people who know to work with bots (pywikipediabot or something similar). (2) We would need make software based on the bot core which would have to be localized: like MediaWiki should be localized; this software should have sentences like "<movie> is movie made in <year> in <country>. Genre of that movie is <genre>. Director was <director>..." in a number of languages. (3) We would need good and quality work on English Wikipedia. Rules like "this goes to the table, that goes to the template up, this goes to template in the middle" should be more or less strict (but, I see that people are working in such way on en:).
This is RFC. I am looking for your comments.
what you're talking about is slightly dangerous, en: has been known to be incorrect quite often :)
on another note, this idea is better implementend in the wikidata idea, but by heart I don't know where info on that lives.
mvg Finne Boonen
On 1/29/06, Milos Rancic millosh@mutualaid.org wrote:
Maybe this should go on Meta, but I want to see comments here, first.
As I can see, there are two ways of mass content adding. The first one includes generation of articles based on some public data (for example NASA, National Geospatial Inteligence Agency, French government etc.) Now, this is almost usual way for mass content adding and I think that a number of us have some experience with such work.
The other way is adding content using English Wikipedia. English Wikipedia has a lot of categorized articles, a lot of templates etc. All these typical forms can be used for automatic content creation on small Wikipedias.
I think that idea of having a thousends of articles with a couple of sentences and good categorization about a lot of fields -- can be very helpful not only to small Wikipedias, but also for spreading free knowledge. I think that it would be a great day for us when people which native language is Mongolian will be able to read about places in Amazon and movies from Australia in their native language. And, this is possible to do much faster then we think.
And not only that: bots should be able to update information; bots should be able to do more things through time. Finally, it would be possible to start with knowledge transfer between Wikipedias in different languages: if we have the same methodology on different Wikipedias, we would be able to update data semi-automatic (up to full automatic).
However, this needs a number of people who are interested in such project:
(1) We would need people who know to work with bots (pywikipediabot or something similar). (2) We would need make software based on the bot core which would have to be localized: like MediaWiki should be localized; this software should have sentences like "<movie> is movie made in <year> in <country>. Genre of that movie is <genre>. Director was <director>..." in a number of languages. (3) We would need good and quality work on English Wikipedia. Rules like "this goes to the table, that goes to the template up, this goes to template in the middle" should be more or less strict (but, I see that people are working in such way on en:).
This is RFC. I am looking for your comments.
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- "Maybe you knew early on that your track went from point A to B, but unlike you I wasn't given a map at birth!" Alyssa, "Chasing Amy" http://hekla.rave.org/cookbook.html - my crossplatform dieet/recipe app
On 1/29/06, Finne Boonen finne@cassia.be wrote:
what you're talking about is slightly dangerous, en: has been known to be incorrect quite often :)
:)) This is the part of collaborative wiki process. If English Wikipedia fails when bot pass the first time, the second or the third will be correct. I am not talking about completely automatic process, but about a team who would work on this issue. Also, if we are able to make something similar based on German Wikipedia, it would be good, too.
on another note, this idea is better implementend in the wikidata idea, but by heart I don't know where info on that lives.
Pure data about something are better then nothing :) Also, pure data are encyclopedic, too. Also, this just should be the beginning of the article (people should continue to work on such article); as well as at the beginning of the project. It is possible to find keywords from some article and generate a number of sentences with a lot of sense.
Hoi, As this is a RFC, I will comment to the RFC itself and not on the other comments.
Danny mentioned in his response that a bot could do great work. Henna did remark that Wikidata could make a difference. Milos mentions that data may need localisation.. I want to remind you about an e-mail that Sabine Cretella send to the lists. Sabine is really active in the Neapolitan Wikipedia. A project much younger than the Swahili wikipedia but already with 4336 articles. The secret of this success is among other things that Sabine uses professional tools to translate into the Neapolitan language. OmegaT, the software Sabine uses, is GPL software and is what is called a CAT or Computer Aided Translation tool. This allows for an efficient translation and is /not /the same as Automated Translation.
When we have Wikidata ready for prime time, we will be able to store structured data about one subject. This is not a full solution as many of the words used in the presentation need to be translated, maybe even localised to make sense in another language. I for instance always have to think if 9/11 is the ninth of November or the eleventh of September; I do know of it for the event. In order to present data, labels have to be translated and data may have to be localised. The WiktionaryZ project will help with the labels and standards like the CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo project is at epov.org/wd-gemet/index.php/Main_Page (a read only project for now). There is a proposal for a project at http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help us where the standards prove to be not good enough. As Sabine is part of the team behind OmegaT, it is being researched how OmegaT can read and write directly to a Mediawiki project.
One other aspect that is needed in new project is commitment. People who express their support for a new language project should see this as an indication of /their /commitment and not as an expression of their opinion. When people start to work on a new project it is important that like on the Neapolitan wikipedia, there are people who are knowledgeable and willing to help the newbies, I hope that the IRC channel #wikipedia-bootcamp can serve a role for this as well.
Thanks, GerardM
Milos Rancic wrote:
Maybe this should go on Meta, but I want to see comments here, first.
As I can see, there are two ways of mass content adding. The first one includes generation of articles based on some public data (for example NASA, National Geospatial Inteligence Agency, French government etc.) Now, this is almost usual way for mass content adding and I think that a number of us have some experience with such work.
The other way is adding content using English Wikipedia. English Wikipedia has a lot of categorized articles, a lot of templates etc. All these typical forms can be used for automatic content creation on small Wikipedias.
I think that idea of having a thousends of articles with a couple of sentences and good categorization about a lot of fields -- can be very helpful not only to small Wikipedias, but also for spreading free knowledge. I think that it would be a great day for us when people which native language is Mongolian will be able to read about places in Amazon and movies from Australia in their native language. And, this is possible to do much faster then we think.
And not only that: bots should be able to update information; bots should be able to do more things through time. Finally, it would be possible to start with knowledge transfer between Wikipedias in different languages: if we have the same methodology on different Wikipedias, we would be able to update data semi-automatic (up to full automatic).
However, this needs a number of people who are interested in such project:
(1) We would need people who know to work with bots (pywikipediabot or something similar). (2) We would need make software based on the bot core which would have to be localized: like MediaWiki should be localized; this software should have sentences like "<movie> is movie made in <year> in <country>. Genre of that movie is <genre>. Director was <director>..." in a number of languages. (3) We would need good and quality work on English Wikipedia. Rules like "this goes to the table, that goes to the template up, this goes to template in the middle" should be more or less strict (but, I see that people are working in such way on en:).
This is RFC. I am looking for your comments.
Hi, please allow me to add my 2 cts here :-)
Danny mentioned in his response that a bot could do great work. Henna did remark that Wikidata could make a difference. Milos mentions that data may need localisation.. I want to remind you about an e-mail that Sabine Cretella send to the lists. Sabine is really active in the Neapolitan Wikipedia. A project much younger than the Swahili wikipedia but already with 4336 articles. The secret of this success is among other things that Sabine uses professional tools to translate into the Neapolitan language. OmegaT, the software Sabine uses, is GPL software and is what is called a CAT or Computer Aided Translation tool. This allows for an efficient translation and is /not /the same as Automated Translation.
I am using OmegaT for things I write in Neapolitan - for the simple fact, that during translation I get terminology proposed, already translated or similar sentences are proposed etc. This has nothing to do with Machine Translation. What is re-used here is a glossary and the translations previously done.
The cities of Campania were uploaded with the pywikipediabot (pagefromfile.py). What is lacking here: a documentation of the single bots and how they work - so more people could learn how to use them since installing python and running the bot is not soooo difficult.
Back to translations with OmegaT: the advantage for languages where we have a community that very often lacks in having a stable Internet access are obvious: translations can be planned, people who can translate, but who maybe would not be good writers, can translate offline and create contents that way. Another person can care about uploading the translated articles, if necessary. Once there is something written others, who are more likely to be writors than translators can start editing and adapting, improving etc.
We have very different people around and not many of them are real good writers. I myself can write about certain stuff, but that is more marketing text than encyclopaedic text - so for the "translation" is the easiest way to do things.
Steps to translate articles now: Copy the Text from the original language wiki in a file. (This can be one page or 50 pages - on file per wiki page or one file with several wiki pages in there ... it does not really matter) Create the project with OmegaT (well it must be on your computer first and you need java on your computer as well) Load the file to be translated Copy the glossaries you have for that language in the glossary directory. Copy eventual Translation Memories from previous projects in the directory for TMs Reload the project Translate Save to target Copy and paste the translated text to the wiki
Now there are some more features that can help you - this is just to "see" how it basically works ... then like any application it has more features that can help you to easen work.
People who are interested in using OmegaT for Wikipedia translations should show up. I know that is is being used for contents on scn wikipedia as well. Maybe it would make sense to create a work-group for this tool and exchange our TM's or upload them to Commons or another common place. Of course the Translation memory could also be stored on a wiki, but that is than another step ahead.
When we have Wikidata ready for prime time, we will be able to store structured data about one subject. This is not a full solution as many of the words used in the presentation need to be translated, maybe even localised to make sense in another language. I for instance always have to think if 9/11 is the ninth of November or the eleventh of September; I do know of it for the event. In order to present data, labels have to be translated and data may have to be localised. The WiktionaryZ project will help with the labels and standards like the CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo project is at epov.org/wd-gemet/index.php/Main_Page (a read only project for now). There is a proposal for a project at http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help us where the standards prove to be not good enough. As Sabine is part of the team behind OmegaT, it is being researched how OmegaT can read and write directly to a Mediawiki project.
Well, see above, but even if it cannot do that now: it is very helpful to work on it locally and offline. The only thing I'd say is important is a note on the wikipedia/wikibooks whatever project of the target language that this article is being translated by XYZ.
One other aspect that is needed in new project is commitment. People who express their support for a new language project should see this as an indication of /their /commitment and not as an expression of their opinion. When people start to work on a new project it is important that like on the Neapolitan wikipedia, there are people who are knowledgeable and willing to help the newbies, I hope that the IRC channel #wikipedia-bootcamp can serve a role for this as well.
Hmmm ... Danke für die Blumen :-) but really there are so many projects that do good :-) very often people just don't know about existing tools or whatever and often these could be the things that make the difference.
I always like to take the African languages as an example, because they somewhat have the most difficult position: People there often do not have the same possibilities we have... well: even if students don't have a stable Internet connection they could well have huge advantages of offline versions on a CD or versions that can be copied on a hd or wherever you can easily store and re-write data for a pc. I know that often they do not even have computers like we have. There is that 100 usd computer for schools ... well this one would be a good approach also for school kids in Europe ... from the first class onwards (thinking about kids here where I live) ... but let's go back to "offline use and work"
Well considering those regions where "online" is a problem students and capable people who would like to see their language present in wikipedia could become pages to translate - even a non-native speaker could care about the upload of the pages. And there should be one person who speaks/writes enough of that language to add pictures, sound etc. Then the ready article can go back to them, and they can distribute it among people.
I know, now you say Wikipedia is an "online encyclopaedia" ... well, it might be ... but does this mean that people who do not have stable online access and who cannot afford it are second class and therefore should not get information in their language? Why should they not be allowed to work on such contents, read it, distribute it ...
Wikipedia is about giving people encyclopaedic information in their language - well I see its scope much wider than that - but basically that's it ... so we should give it to them, right?
OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?
There would be so much more to write about these things .... well let's take this thread as a good start :-) and let us develop strategies and work together.
Best,
Sabine
*****and there is my never ending wish ... that day of at least 48 hours ...*****
___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it
Thank you for support and interest in this issue. I made the page http://meta.wikimedia.org/wiki/Mass_content_adding where we should continue to organize this work. I made a little bit of organization, but all suggestions and active participation are welcome. This is a huge area and I am sure that I forgot a lot of things.
On 1/30/06, Sabine Cretella sabine_cretella@yahoo.it wrote:
Hi, please allow me to add my 2 cts here :-)
Danny mentioned in his response that a bot could do great work. Henna did remark that Wikidata could make a difference. Milos mentions that data may need localisation.. I want to remind you about an e-mail that Sabine Cretella send to the lists. Sabine is really active in the Neapolitan Wikipedia. A project much younger than the Swahili wikipedia but already with 4336 articles. The secret of this success is among other things that Sabine uses professional tools to translate into the Neapolitan language. OmegaT, the software Sabine uses, is GPL software and is what is called a CAT or Computer Aided Translation tool. This allows for an efficient translation and is /not /the same as Automated Translation.
I am using OmegaT for things I write in Neapolitan - for the simple fact, that during translation I get terminology proposed, already translated or similar sentences are proposed etc. This has nothing to do with Machine Translation. What is re-used here is a glossary and the translations previously done.
The cities of Campania were uploaded with the pywikipediabot (pagefromfile.py). What is lacking here: a documentation of the single bots and how they work - so more people could learn how to use them since installing python and running the bot is not soooo difficult.
Back to translations with OmegaT: the advantage for languages where we have a community that very often lacks in having a stable Internet access are obvious: translations can be planned, people who can translate, but who maybe would not be good writers, can translate offline and create contents that way. Another person can care about uploading the translated articles, if necessary. Once there is something written others, who are more likely to be writors than translators can start editing and adapting, improving etc.
We have very different people around and not many of them are real good writers. I myself can write about certain stuff, but that is more marketing text than encyclopaedic text - so for the "translation" is the easiest way to do things.
Steps to translate articles now: Copy the Text from the original language wiki in a file. (This can be one page or 50 pages - on file per wiki page or one file with several wiki pages in there ... it does not really matter) Create the project with OmegaT (well it must be on your computer first and you need java on your computer as well) Load the file to be translated Copy the glossaries you have for that language in the glossary directory. Copy eventual Translation Memories from previous projects in the directory for TMs Reload the project Translate Save to target Copy and paste the translated text to the wiki
Now there are some more features that can help you - this is just to "see" how it basically works ... then like any application it has more features that can help you to easen work.
People who are interested in using OmegaT for Wikipedia translations should show up. I know that is is being used for contents on scn wikipedia as well. Maybe it would make sense to create a work-group for this tool and exchange our TM's or upload them to Commons or another common place. Of course the Translation memory could also be stored on a wiki, but that is than another step ahead.
When we have Wikidata ready for prime time, we will be able to store structured data about one subject. This is not a full solution as many of the words used in the presentation need to be translated, maybe even localised to make sense in another language. I for instance always have to think if 9/11 is the ninth of November or the eleventh of September; I do know of it for the event. In order to present data, labels have to be translated and data may have to be localised. The WiktionaryZ project will help with the labels and standards like the CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo project is at epov.org/wd-gemet/index.php/Main_Page (a read only project for now). There is a proposal for a project at http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help us where the standards prove to be not good enough. As Sabine is part of the team behind OmegaT, it is being researched how OmegaT can read and write directly to a Mediawiki project.
Well, see above, but even if it cannot do that now: it is very helpful to work on it locally and offline. The only thing I'd say is important is a note on the wikipedia/wikibooks whatever project of the target language that this article is being translated by XYZ.
One other aspect that is needed in new project is commitment. People who express their support for a new language project should see this as an indication of /their /commitment and not as an expression of their opinion. When people start to work on a new project it is important that like on the Neapolitan wikipedia, there are people who are knowledgeable and willing to help the newbies, I hope that the IRC channel #wikipedia-bootcamp can serve a role for this as well.
Hmmm ... Danke für die Blumen :-) but really there are so many projects that do good :-) very often people just don't know about existing tools or whatever and often these could be the things that make the difference.
I always like to take the African languages as an example, because they somewhat have the most difficult position: People there often do not have the same possibilities we have... well: even if students don't have a stable Internet connection they could well have huge advantages of offline versions on a CD or versions that can be copied on a hd or wherever you can easily store and re-write data for a pc. I know that often they do not even have computers like we have. There is that 100 usd computer for schools ... well this one would be a good approach also for school kids in Europe ... from the first class onwards (thinking about kids here where I live) ... but let's go back to "offline use and work"
Well considering those regions where "online" is a problem students and capable people who would like to see their language present in wikipedia could become pages to translate - even a non-native speaker could care about the upload of the pages. And there should be one person who speaks/writes enough of that language to add pictures, sound etc. Then the ready article can go back to them, and they can distribute it among people.
I know, now you say Wikipedia is an "online encyclopaedia" ... well, it might be ... but does this mean that people who do not have stable online access and who cannot afford it are second class and therefore should not get information in their language? Why should they not be allowed to work on such contents, read it, distribute it ...
Wikipedia is about giving people encyclopaedic information in their language - well I see its scope much wider than that - but basically that's it ... so we should give it to them, right?
OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?
There would be so much more to write about these things .... well let's take this thread as a good start :-) and let us develop strategies and work together.
Best,
Sabine
*****and there is my never ending wish ... that day of at least 48 hours ...*****
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
Hi Milos, hi Sabine,
There is one aspect in this project that is totally missing so far: it has to be checked, whether these small wikipedias really want this mass inclusion. Please do not add such content without consensus.
Just imagine a small wikipedia with e.g. 2000 articles. If you then add 20000 stubs about geographic places, this will ususally not do the wikipedia a favour. I just try to imagine 2000 articles following the template "x is a city in y with 20000 inhabitants. a% of them speak b c% of them speak d x lies in the administrative region of z".
It would just promise an incredibly high article count and once you browse through the wikipedia 95% of you find is template articles. A small wikipedia's team will probably not have the resources of turning 20000 geographical stubs into meaningfull articles and even if it had, it would probably prefer spending their time on other things that are more important than stubs about every village in other countries than their own.
I am not convinced that this is a service. To me it looks more than a burden.
Kind regards, Heiko Evermann
For small wikipedias we thought about something more meaningful: pseudo-translation of countries, rivers, mountains, islands, movies, actors etc. from English Wikipedia. Yes, I know that concesus should be reached for 100.000 new articles, but if people from small Wikipedias want that, they will localize software. If they don't want, we wouldn't have even stubs in their language.
On 2/24/06, Heiko Evermann heiko.evermann@gmx.de wrote:
Hi Milos, hi Sabine,
There is one aspect in this project that is totally missing so far: it has to be checked, whether these small wikipedias really want this mass inclusion. Please do not add such content without consensus.
Just imagine a small wikipedia with e.g. 2000 articles. If you then add 20000 stubs about geographic places, this will ususally not do the wikipedia a favour. I just try to imagine 2000 articles following the template "x is a city in y with 20000 inhabitants. a% of them speak b c% of them speak d x lies in the administrative region of z".
It would just promise an incredibly high article count and once you browse through the wikipedia 95% of you find is template articles. A small wikipedia's team will probably not have the resources of turning 20000 geographical stubs into meaningfull articles and even if it had, it would probably prefer spending their time on other things that are more important than stubs about every village in other countries than their own.
I am not convinced that this is a service. To me it looks more than a burden.
Kind regards, Heiko Evermann _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
"Heiko Evermann" heiko.evermann@gmx.de wrote in message news:200602240904.30927.heiko.evermann@gmx.de...
There is one aspect in this project that is totally missing so far: it has to be checked, whether these small wikipedias really want this mass inclusion. Please do not add such content without consensus. It would just promise an incredibly high article count and once you browse through the wikipedia 95% of you find is template articles. A small wikipedia's team will probably not have the resources of turning 20000 geographical stubs into meaningfull articles and even if it had, it would probably prefer spending their time on other things that are more important than stubs about every village in other countries than their own.
On the other hand, it would provide visitors with excellent opportunities to edit something relevant to themselves: where they live.
Once they got the editing bug, there would then be plenty of articles they could go forth and clean-up.
I'm thinking that a Wikipedia chock-full of geographical stubs is a lot better than one chock-full of pop-culture stubs, which would likely prove totally irrelevant to almost every Wikipedia other than that for the home language.
HTH HAND
What I wonder about TM is, how does it work with languages with different structures?
It's quite obvious TM works well for Russian, Italian, Spanish, French, German, other languages of similar structure. I heard it also works for Chinese, Japanese, Korean, Arabic, Farsi, Hebrew as well.
So my main questions are:
1) Can it handle languages which don't separate words in writing? Examples are Thai, Lao, Japanese, Chinese, and a number of smaller languages.
2) Can it handle languages of all typological classifications? So far I have seen it works well for isolating (such as Chinese, Vietnamese) and inflecting languages (such as Russian, Polish, Latin), but what about polysynthetic languages (such as Inuktitut, Turkish, Georgian, Adyghe, Abkhaz, Mohawk)? I would imagine it would be more difficult for these languages. For example, Western Greenlandic "Aliikusersuillammassuaanerartassagaluarpaalli." means "However, they will say that he is a great entertainer, but..." (for other long words like this, just look at the greenlandic wikipedia, kl.wp).
3) Can it mass-process huge amounts of content quickly, to be reviewed later by humans?
Mark
On 30/01/06, Sabine Cretella sabine_cretella@yahoo.it wrote:
Hi, please allow me to add my 2 cts here :-)
Danny mentioned in his response that a bot could do great work. Henna did remark that Wikidata could make a difference. Milos mentions that data may need localisation.. I want to remind you about an e-mail that Sabine Cretella send to the lists. Sabine is really active in the Neapolitan Wikipedia. A project much younger than the Swahili wikipedia but already with 4336 articles. The secret of this success is among other things that Sabine uses professional tools to translate into the Neapolitan language. OmegaT, the software Sabine uses, is GPL software and is what is called a CAT or Computer Aided Translation tool. This allows for an efficient translation and is /not /the same as Automated Translation.
I am using OmegaT for things I write in Neapolitan - for the simple fact, that during translation I get terminology proposed, already translated or similar sentences are proposed etc. This has nothing to do with Machine Translation. What is re-used here is a glossary and the translations previously done.
The cities of Campania were uploaded with the pywikipediabot (pagefromfile.py). What is lacking here: a documentation of the single bots and how they work - so more people could learn how to use them since installing python and running the bot is not soooo difficult.
Back to translations with OmegaT: the advantage for languages where we have a community that very often lacks in having a stable Internet access are obvious: translations can be planned, people who can translate, but who maybe would not be good writers, can translate offline and create contents that way. Another person can care about uploading the translated articles, if necessary. Once there is something written others, who are more likely to be writors than translators can start editing and adapting, improving etc.
We have very different people around and not many of them are real good writers. I myself can write about certain stuff, but that is more marketing text than encyclopaedic text - so for the "translation" is the easiest way to do things.
Steps to translate articles now: Copy the Text from the original language wiki in a file. (This can be one page or 50 pages - on file per wiki page or one file with several wiki pages in there ... it does not really matter) Create the project with OmegaT (well it must be on your computer first and you need java on your computer as well) Load the file to be translated Copy the glossaries you have for that language in the glossary directory. Copy eventual Translation Memories from previous projects in the directory for TMs Reload the project Translate Save to target Copy and paste the translated text to the wiki
Now there are some more features that can help you - this is just to "see" how it basically works ... then like any application it has more features that can help you to easen work.
People who are interested in using OmegaT for Wikipedia translations should show up. I know that is is being used for contents on scn wikipedia as well. Maybe it would make sense to create a work-group for this tool and exchange our TM's or upload them to Commons or another common place. Of course the Translation memory could also be stored on a wiki, but that is than another step ahead.
When we have Wikidata ready for prime time, we will be able to store structured data about one subject. This is not a full solution as many of the words used in the presentation need to be translated, maybe even localised to make sense in another language. I for instance always have to think if 9/11 is the ninth of November or the eleventh of September; I do know of it for the event. In order to present data, labels have to be translated and data may have to be localised. The WiktionaryZ project will help with the labels and standards like the CLDR are what define how the localisation is to be done.
We are making steady progress with WiktionaryZ, the first alpha demo project is at epov.org/wd-gemet/index.php/Main_Page (a read only project for now). There is a proposal for a project at http://meta.wikimedia.org/wiki/Wiki_for_standards that intends to help us where the standards prove to be not good enough. As Sabine is part of the team behind OmegaT, it is being researched how OmegaT can read and write directly to a Mediawiki project.
Well, see above, but even if it cannot do that now: it is very helpful to work on it locally and offline. The only thing I'd say is important is a note on the wikipedia/wikibooks whatever project of the target language that this article is being translated by XYZ.
One other aspect that is needed in new project is commitment. People who express their support for a new language project should see this as an indication of /their /commitment and not as an expression of their opinion. When people start to work on a new project it is important that like on the Neapolitan wikipedia, there are people who are knowledgeable and willing to help the newbies, I hope that the IRC channel #wikipedia-bootcamp can serve a role for this as well.
Hmmm ... Danke für die Blumen :-) but really there are so many projects that do good :-) very often people just don't know about existing tools or whatever and often these could be the things that make the difference.
I always like to take the African languages as an example, because they somewhat have the most difficult position: People there often do not have the same possibilities we have... well: even if students don't have a stable Internet connection they could well have huge advantages of offline versions on a CD or versions that can be copied on a hd or wherever you can easily store and re-write data for a pc. I know that often they do not even have computers like we have. There is that 100 usd computer for schools ... well this one would be a good approach also for school kids in Europe ... from the first class onwards (thinking about kids here where I live) ... but let's go back to "offline use and work"
Well considering those regions where "online" is a problem students and capable people who would like to see their language present in wikipedia could become pages to translate - even a non-native speaker could care about the upload of the pages. And there should be one person who speaks/writes enough of that language to add pictures, sound etc. Then the ready article can go back to them, and they can distribute it among people.
I know, now you say Wikipedia is an "online encyclopaedia" ... well, it might be ... but does this mean that people who do not have stable online access and who cannot afford it are second class and therefore should not get information in their language? Why should they not be allowed to work on such contents, read it, distribute it ...
Wikipedia is about giving people encyclopaedic information in their language - well I see its scope much wider than that - but basically that's it ... so we should give it to them, right?
OmegaT runs on Windows, Linux, OSX etc. so why not take up that possibility?
There would be so much more to write about these things .... well let's take this thread as a good start :-) and let us develop strategies and work together.
Best,
Sabine
*****and there is my never ending wish ... that day of at least 48 hours ...*****
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- "Take away their language, destroy their souls." -- Joseph Stalin
Hi Mark,
What I wonder about TM is, how does it work with languages with different structures?
It's quite obvious TM works well for Russian, Italian, Spanish, French, German, other languages of similar structure. I heard it also works for Chinese, Japanese, Korean, Arabic, Farsi, Hebrew as well.
So my main questions are:
- Can it handle languages which don't separate words in writing?
Examples are Thai, Lao, Japanese, Chinese, and a number of smaller languages.
Yes - there are translators using Thai, Japanese and Chinese within OmegaT - we also have people in the development team that work at least with one of these languages.
- Can it handle languages of all typological classifications? So far
I have seen it works well for isolating (such as Chinese, Vietnamese) and inflecting languages (such as Russian, Polish, Latin), but what about polysynthetic languages (such as Inuktitut, Turkish, Georgian, Adyghe, Abkhaz, Mohawk)? I would imagine it would be more difficult for these languages. For example, Western Greenlandic "Aliikusersuillammassuaanerartassagaluarpaalli." means "However, they will say that he is a great entertainer, but..." (for other long words like this, just look at the greenlandic wikipedia, kl.wp).
Well within OmegaT you have UTF-8 usage - so most languages are supported, for some we might have to try out, others might require special solutions. Basically all that is UTF-8 should not create problems.
- Can it mass-process huge amounts of content quickly, to be reviewed
later by humans?
No - when Talking about OmegaT wer are not talking about machine translation, but computer assisted translation - that means a human translator re-uses translation memories from other projects, exchanged TMs etc. While translating the glossary entries are checked and OmegaT shows you the matching entries in a separate window. Should sentences be equal to former translated ones or similar, according to your settings within the software you can have it just proposed in a separate window or OmegaT can overwrite the sentence to be translated with the full or partial match sentence.
One feature I would very much like to see is assemble from portions, but this will be only at discussion after having it connected to Wiktionaryz, that is when there is tbx support - it does not make sense to talk about this very specific and helpful feature before.
The translation memory you are working with is only as good as you created it. The more you work with it, the better it becomes. That's basically it.
One thing that I also find very helpful: people that speak a language, but are not mothertognue easily can check how a word was translated before - which context etc. So this can help a lot during work and gives better results. Therefore the proof reading effort by mothertognue speakers will be less.
With proper set up segmentation rules, for example, you can go through the born and died people of the calendar quite fast, sinche descriptions are quite repetitive.
Please note: I am having a meeting with a group of colleagues this week-end and next week I am at the university of Pisa to give a presentation and a workshop - so if you write and need answers from me directly, please note it in the subject since it could well be that I then cannot see all posts.
Have a great week-end!
Best, Sabine
___________________________________ Yahoo! Messenger with Voice: chiama da PC a telefono a tariffe esclusive http://it.messenger.yahoo.com
Well, what should the people from a smaller Wikipedia do to become a testing place for the new project? I am ready to cooperate about the Ossetic (and probably also Chuvash and Udmurt) wikipedia. Really, thousands of new articles might be a not so good idea, but creating stubs about all countries and probably about the largest cities of the world sounds good.
So, what is the algorythm?
Slavik IVANOV os, ru, udm: User:Amikeco
On 3/4/06, V. Ivanov amikeco@gmail.com wrote:
So, what is the algorythm?
Go to http://meta.wikimedia.org/wiki/Mass_content_adding and find sub project "Countries of the world".
Algorithm...:
1) Template infobox for countries: In this moment I added only template infobox for countries from English Wikipedia. On the last meeting (last Sunday) of Wikipedians from Belgrade, we were talking about infobox for countries (they started talk not related to the project Mass content adding) and we realized that infobox for countries on English Wikipedia can be better. I'll try to talk with them to improve English infobox for countries.
We need localization of this template (I think that I described that on the page): values (so, value "capital" should not be written in English on other Wikipedias) and left side (text from the left side of the template; in general, phrases like "capital", "largest city" etc.).
2) General translation: After that, we need translation/transcription of the names of the countries, cities, languages etc. Maybe we can use some developed localizations (like localizations of glibc, KDE, GNOME, Unicode project etc.) But, I am not sure that Ossetic, Chuvash and Udmurt have such localizations... However, I am sure that there are some books from former Soviet Union with lists of countries, languages, cities etc.
3) As I see, after "the first two steps", we would be able to start bots.
Also, note that we are at the beginning of the project. We need to develop infrastructure (strategies, methods, programs, localizations...). And all people with constructive ideas are welcome not only to participate, but to drive the project.
2006/3/4, Milos Rancic millosh@mutualaid.org:
- General translation: After that, we need translation/transcription
of the names of the countries, cities, languages etc.
How can I pass the names to you? We have some ready lists by now: * Larger islands: http://os.wikipedia.org/wiki/%D0%A1%D0%B0%D0%BA%D1%8A%D0%B0%D0%B4%D0%B0%D1%8... * Countries: http://os.wikipedia.org/wiki/%D0%91%C3%A6%D1%81%D1%82%C3%A6%D1%82%D1%8B_%D0%... * Languages that use(d) Cyrillic script: http://os.wikipedia.org/wiki/%D0%9A%D0%B8%D1%80%D0%B8%D0%BB%D0%BB%D0%BE%D0%B... * Lists of some countries' towns: http://os.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B...
The lists are more or less official, based on existing dictionaries and localisation rules.
Slavik IVANOV
-- Esperu cxiam!
On 3/9/06, V. Ivanov amikeco@gmail.com wrote:
2006/3/4, Milos Rancic millosh@mutualaid.org:
- General translation: After that, we need translation/transcription
of the names of the countries, cities, languages etc.
How can I pass the names to you? We have some ready lists by now:
- Larger islands:
- Countries:
- Languages that use(d) Cyrillic script:
- Lists of some countries' towns:
The main page for localization is http://meta.wikimedia.org/wiki/Mass_content_adding/Localization
You can see how did I start localization process for some NGA data (for example: http://meta.wikimedia.org/wiki/Mass_content_adding/Geographic_data/Using_NGA...).
But, just use namespace like: http://meta.wikimedia.org/wiki/Mass_content_adding/Localization/Geography/Co... or whatever you think that it is the best. I opened this issue and I am a participant, but I don't think that I should make explicit rules there alone :) When enough people start to work in this project, we will be able to make more strictly rules.
I am preparing some more infrastructure for the project and I'll write to this list when I make something.
Will it be easy enough to use the tables like http://meta.wikimedia.org/wiki/Mass_content_adding/Geographic_data/Using_NGA... in bot processing? Maybe some other format is preferable? Should I just translate the table as it is or we can make out a special format for such localizations?
Another question is: what should be our primary goal? What we are going to start with? Countries of the world? Surely, if we have a not so broad direction to start with, we'll have the results sooner.
I've recently announced about the project at a Internet news-service controlled by me: http://www.e-novosti.info/blog/04.03.2006/1 (in Russian).
2006/3/9, Milos Rancic millosh@mutualaid.org:
On 3/9/06, V. Ivanov amikeco@gmail.com wrote:
2006/3/4, Milos Rancic millosh@mutualaid.org:
- General translation: After that, we need translation/transcription
of the names of the countries, cities, languages etc.
How can I pass the names to you? We have some ready lists by now:
- Larger islands:
- Countries:
- Languages that use(d) Cyrillic script:
- Lists of some countries' towns:
The main page for localization is http://meta.wikimedia.org/wiki/Mass_content_adding/Localization
You can see how did I start localization process for some NGA data (for example: http://meta.wikimedia.org/wiki/Mass_content_adding/Geographic_data/Using_NGA...).
But, just use namespace like: http://meta.wikimedia.org/wiki/Mass_content_adding/Localization/Geography/Co... or whatever you think that it is the best. I opened this issue and I am a participant, but I don't think that I should make explicit rules there alone :) When enough people start to work in this project, we will be able to make more strictly rules.
I am preparing some more infrastructure for the project and I'll write to this list when I make something. _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- Esperu cxiam!
Before rushing into this stub maniac, please think! Is it really a good idea to massadd substubs? Is the 'pedia of higher quality because it has more substubs about 20000 obscure regions than without them? If this is used, I believe it should aim towards making relevant articles that can be easily expanded by people speaking the language. We've seen what mass content adders have done to itwiki and plwiki, they simply rush to get many articles and completely forget about quality... Remember http://meta.wikimedia.org/wiki/Substub_disease !
/Andreas
On 3/4/06, V. Ivanov amikeco@gmail.com wrote:
Well, what should the people from a smaller Wikipedia do to become a testing place for the new project? I am ready to cooperate about the Ossetic (and probably also Chuvash and Udmurt) wikipedia. Really, thousands of new articles might be a not so good idea, but creating stubs about all countries and probably about the largest cities of the world sounds good.
So, what is the algorythm?
Slavik IVANOV os, ru, udm: User:Amikeco _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
On 3/4/06, Andreas Vilén andreas.vilen@gmail.com wrote:
Before rushing into this stub maniac, please think! Is it really a good idea to massadd substubs? Is the 'pedia of higher quality because it has more substubs about 20000 obscure regions than without them? If this is used, I believe it should aim towards making relevant articles that can be easily expanded by people speaking the language. We've seen what mass content adders have done to itwiki and plwiki, they simply rush to get many articles and completely forget about quality... Remember http://meta.wikimedia.org/wiki/Substub_disease !
There is no rush. I don't expect that we would add some relevant data in the next couple of months. For example, if we want to make articles about all countries in the world, we need a lot of work on localization of templates, categories and a couple of sentences.
BUT, when we are talking about stubs on, for example, Swahili Wikipedia, for example, article about Mongolian language which would contain that: (1) How "Mongolian language" is written in Mongolian; (2) it is spoken in [[China]], [[Kyrgyzstan]], [[Mongolia]], [[Russia]]; (3) it is spoken by 5,7 million of humans; (4) genetic classification of Mongolian language is: Altaic (disputed) -> Mongolic -> Eastern -> Oirat-Khalkha -> Khalkha-Buryat -> Mongolian; (5) language is official in Mongolia, China (Inner Mongolia), Russia (Buryatia); (6) the language is not regulated; (7) ISO and Ethnologue codes are ...; (8) consonants and vowels are... etc. -- I think that it can be treated as a stub on English Wikipedia, but it is very informative article for a person who knows Swahili, but doesn't know English. And it is completely possible to do with bots and localization.
Also, a lot of languages which don't have more then 50 millions of speakers (I think that there are maybe 20 languages with more then 50 millions of speakers!) -- don't have encyclopedias like Britannica and similar. For example, the biggest encyclopedia in Serbian language has almost 100.000 articles. BUT, I saw that encyclopedia maybe two times in my life (I saw Britannica a lot often even I am living in Belgrade!). "Ordinary encyclopedia", which can be found in almost every house, has between 30.000-50.000 very small articles. 90% of such articles would be stubs on Serbian Wikipedia.
In other words: Maybe people who are native speakers of English, German, Russian, Chinese, Spanish, French and Portuguese are able to have "higher standards" about what article is stub and what is not; they may remove article with one sentence about some Arabic dynasty, but our encyclopedias are full of such articles AND this sentence is very relevant encyclopedic information. And we are not talking about one-two sentence articles, but about articles which would not be stubs even for English Wikipedia. (I can make one page article about Mongolian based on information from the article on English Wikipedia.)
Great! It's the "xx is a yy in zz" substubs I'm against. Informational stubs are very good indeed.
/Andreas
On 3/4/06, Milos Rancic millosh@mutualaid.org wrote:
On 3/4/06, Andreas Vilén andreas.vilen@gmail.com wrote:
Before rushing into this stub maniac, please think! Is it really a good idea to massadd substubs? Is the 'pedia of higher quality because it has more substubs about 20000 obscure regions than without them? If this is used, I believe it should aim towards making relevant articles that can be easily expanded by people speaking the language. We've seen what mass content adders have done to itwiki and plwiki, they simply rush to get many articles and completely forget about quality... Remember http://meta.wikimedia.org/wiki/Substub_disease !
There is no rush. I don't expect that we would add some relevant data in the next couple of months. For example, if we want to make articles about all countries in the world, we need a lot of work on localization of templates, categories and a couple of sentences.
BUT, when we are talking about stubs on, for example, Swahili Wikipedia, for example, article about Mongolian language which would contain that: (1) How "Mongolian language" is written in Mongolian; (2) it is spoken in [[China]], [[Kyrgyzstan]], [[Mongolia]], [[Russia]]; (3) it is spoken by 5,7 million of humans; (4) genetic classification of Mongolian language is: Altaic (disputed) -> Mongolic -> Eastern -> Oirat-Khalkha -> Khalkha-Buryat -> Mongolian; (5) language is official in Mongolia, China (Inner Mongolia), Russia (Buryatia); (6) the language is not regulated; (7) ISO and Ethnologue codes are ...; (8) consonants and vowels are... etc. -- I think that it can be treated as a stub on English Wikipedia, but it is very informative article for a person who knows Swahili, but doesn't know English. And it is completely possible to do with bots and localization.
Also, a lot of languages which don't have more then 50 millions of speakers (I think that there are maybe 20 languages with more then 50 millions of speakers!) -- don't have encyclopedias like Britannica and similar. For example, the biggest encyclopedia in Serbian language has almost 100.000 articles. BUT, I saw that encyclopedia maybe two times in my life (I saw Britannica a lot often even I am living in Belgrade!). "Ordinary encyclopedia", which can be found in almost every house, has between 30.000-50.000 very small articles. 90% of such articles would be stubs on Serbian Wikipedia.
In other words: Maybe people who are native speakers of English, German, Russian, Chinese, Spanish, French and Portuguese are able to have "higher standards" about what article is stub and what is not; they may remove article with one sentence about some Arabic dynasty, but our encyclopedias are full of such articles AND this sentence is very relevant encyclopedic information. And we are not talking about one-two sentence articles, but about articles which would not be stubs even for English Wikipedia. (I can make one page article about Mongolian based on information from the article on English Wikipedia.) _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
Andreas Vilén wrote:
Before rushing into this stub maniac, please think! Is it really a good idea to massadd substubs?
Instead of introvertly "thinking" about it, if anybody would actually make a scientific study of growth strategies of various wikis (or languages of Wikipedia), I'm afraid they would find that adding lots of stubs actually does pay off, at least in the short run. It makes the website a bigger target for search engine queries, and this draws a bigger audience, from where contributors are recruited. Creating a stub entry for every little town in the country where the language is spoken, or for every semi-famous person that speaks the language, can fill gaps that other websites in the same language didn't cover. The importance of this effect depends heavily on how many other websites already exist in the language. For English, when Wikipedia started in 2001, IMDb.com already covered most every actor and film director. But in German this was not the case, and the German Wikipedia filled a really big gap during its rapid growth in 2002-2004. I think there can be a really big advantage in taking the substub track to growth. If you want to stay on the quality track, you will need really strong policies and actions.
For Swedish, one can say that susning.nu absorbed the substub phase during its rapid growth in 2002-2003, and this has been used as an argument to do something better with the Swedish Wikipedia.
2006/3/8, Lars Aronsson lars@aronsson.se:
Creating a stub entry for every little town in the country where the language is spoken, or for every semi-famous person that speaks the language, can fill gaps that other websites in the same language didn't cover. The importance of this effect depends heavily on how many other websites already exist in the language.
That's a very important point! Stubs are underestimated by speakers of larger languages, but for those who speak a smaller language even a stub information is often a good piece of knowledge. For some it can be useful enough as it is, for others it can be a challenge to write in *their mother tongue* a larger article.
That's why I am interested in the project, Milos is talking about.
Slavik IVANOV
-- Esperu cxiam!
Yes, but I guess the same does not apply to 20000 substubs about towns in Poland and France, in a Wikipedia in a language spoken by some thousand people in Africa...
/Andreas
On 3/9/06, V. Ivanov amikeco@gmail.com wrote:
2006/3/8, Lars Aronsson lars@aronsson.se:
Creating a stub entry for every little town in the country where the language is spoken, or for every semi-famous person that speaks the language, can fill gaps that other websites in the same language didn't cover. The importance of this effect depends heavily on how many other websites already exist in the language.
That's a very important point! Stubs are underestimated by speakers of larger languages, but for those who speak a smaller language even a stub information is often a good piece of knowledge. For some it can be useful enough as it is, for others it can be a challenge to write in *their mother tongue* a larger article.
That's why I am interested in the project, Milos is talking about.
Slavik IVANOV
-- Esperu cxiam! _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
On 3/10/06, Andreas Vilén andreas.vilen@gmail.com wrote:
Yes, but I guess the same does not apply to 20000 substubs about towns in Poland and France, in a Wikipedia in a language spoken by some thousand people in Africa...
There are 40.000 only in France ;)
I don't think that it should be the first goal on Swahili Wikipedia, but when we add a number of more relevant articles, Congo and Mozambique places will be enough relevant. And, after Africa, Europe and Asia will be enough relevant, too.
wikipedia-l@lists.wikimedia.org