Different alphabets for the same language

List overview All Threads
Download

newer

older

How to read database using only...

Category array

monk＠zoomcon.com

25 Mar 2005 25 Mar '05

8:58 p.m.

Hello wikitech-l,

Belarusian language (http://en.wikipedia.org/wiki/Belarusian_language) has now two quite widely used alphabet versions - Cyrillics and Latin (actually, it also has Arabic alphabet, but it is too rarely used).

For now, be: wikipedia uses Cyrillics. But we really need Latin version for those, who prefers to use this alphabet. We have strict bidirectional rules to transform any text between Cyrillics <-> Latin.

We are interested in creating of a "live mirror" (automatic translator) between Cyrillics and Latin alphabets for be: Wikipedia. I mean, it would be great, if anyone could read and submit any article in either alphabet.

As far as i know, something similar was created for different alphabets of Chinese language, so this issue should be worked over already.

I'm myself an experienced PHP+MySQL developer, so I can directly participate in this project.

Can anyone provide their thoughts and any help about this issue? It is surely interesting and quite important thing.

Thank you.

-- Best regards, Monk ([[en:User:Monkbel]] mailto:monk@zoomcon.com

Show replies by date

zhengzhu

26 Mar 26 Mar

10:19 p.m.

You can take a look at LanguageZh.php, which implements most stuff related to the Chinese conversion system.

If the conversion between Cyrillics and Latin is strictly one-to-one, then it should be fairly simple to support that, since most related code is already written. All we need is probably a mapping table between the Cyrillics and the Latin alphabet. If you can put the table somewhere, I can probably implement it and put up a test site fairly quickly. But since I have no knowledge with those languages, I can be wrong.

-- zhengzhu

Brion Vibber

27 Mar 27 Mar

1:52 a.m.

zhengzhu wrote:

...

If the conversion between Cyrillics and Latin is strictly one-to-one, then it should be fairly simple to support that, since most related code is already written. All we need is probably a mapping table between the Cyrillics and the Latin alphabet. If you can put the table somewhere, I can probably implement it and put up a test site fairly quickly. But since I have no knowledge with those languages, I can be wrong.

The last time this was brought up, I seem to recall that the conversion is _not_ perfect; foreign words and names of either Latin or Cyrillic-spelling origin may not use the generic conversions. Also if you're quoting eg foreign-language text, math, or programming code (all likely to come up in an encyclopedia) it's going to be harder; it'd be necessary to be able to specify non-convertible text.

-- brion vibber (brion @ pobox.com)

Nikola Smolenski

3:26 a.m.

On Saturday 26. March 2005. 22:52, Brion Vibber wrote:

...

zhengzhu wrote:

...
If the conversion between Cyrillics and Latin is strictly one-to-one, then it should be fairly simple to support that, since most related code is already written. All we need is probably a mapping table between the Cyrillics and the Latin alphabet. If you can put the table somewhere, I can probably implement it and put up a test site fairly quickly. But since I have no knowledge with those languages, I can be wrong.

The last time this was brought up, I seem to recall that the conversion is _not_ perfect; foreign words and names of either Latin or Cyrillic-spelling origin may not use the generic conversions. Also if you're quoting eg foreign-language text, math, or programming code (all likely to come up in an encyclopedia) it's going to be harder; it'd be necessary to be able to specify non-convertible text.

You remember correctly :) Basically, conversion from Cyrillc to Latin alphabet is possible and easy to do; but conversion from Latin to Cyrillic is impossible task. It would thus be possible to create a static Wikipedia mirror in Latin alphabet, if there would be any need for it; but making it editable and importing results back into Cyrillic would not work and only create problems.

Brion Vibber

7:03 a.m.

Nikola Smolenski wrote:

...

You remember correctly :) Basically, conversion from Cyrillc to Latin alphabet is possible and easy to do; but conversion from Latin to Cyrillic is impossible task. It would thus be possible to create a static Wikipedia mirror in Latin alphabet, if there would be any need for it; but making it editable and importing results back into Cyrillic would not work and only create problems.

If converting Cyrillic -> Latin is good enough, and the primary editing environment is going to be Cyrillic, then adapting zhengzhu's Chinese conversion support probably will work okay.

At http://zh.wikipedia.org/ you'll notice three additional tabs at the right-hand of the top tab bar: "不转换" for unconverted display, "简体" for simplified conversion, and "繁体" for traditional conversion. These conversions don't change the source text or alter editing; it's for rendered page display and user interface only.

We could have the 'Cyrillic' display just pass through the source unaltered, and the 'Latin' display go ahead and apply the conversions in one direction for display only.

-- brion vibber (brion @ pobox.com)

zhengzhu

8:45 a.m.

...

At http://zh.wikipedia.org/ you'll notice three additional tabs at the right-hand of the top tab bar: "不转换" for unconverted display, "简体" for simplified conversion, and "繁体" for traditional conversion. These conversions don't change the source text or alter editing; it's for rendered page display and user interface only.

The implementation also supports a markup that allows manually specifying the conversion of a specific word/phrase in situations where the mapping table is not sufficient. For example, -{zh-cn:foo; zh-tw:bar}- will show up as "foo" when "简体"(Simplified Chinese) is selected, and "bar" when "繁体"(Traditional Chinese) is selected; -{foo}- will show up as "foo" no matter what language variant is selected (this should be sufficient for things like quotes, math, etc). There is also a way of customizing the mapping table through a page in the MediaWiki: namespace.

-- zhengzhu

zhengzhu

11 Apr 11 Apr

6:02 p.m.

With Monk's help, I have implemented a preliminary system for conversion between cyrillics and lation for BE. There is a test site at

http://s87257573.onlinehome.us/be/

-- zhengzhu

Pablo Saratxaga

6:44 p.m.

Kaixo!

On Mon, Apr 11, 2005 at 09:02:24AM -0400, zhengzhu wrote:

...

With Monk's help, I have implemented a preliminary system for conversion between cyrillics and lation for BE. There is a test site at

http://s87257573.onlinehome.us/be/

very nice indeed, a lot of languages should benefit of such feature.

However, languages that have latin script in their list show that there are various exceptions to implement:

- html entities (starting with '&' and ending with ';') must not be converted. - interwiki links should be handled as exceptions too (as the list of valid interwiki domains is known (eg, the possible xx in [[xx:foo]]) it should be easy to implement - it would be nice also to detect urls and not convert them

then, it would be need to have a way to force conversion even for things otherwise being exceptions (that is, the opposite of -{ }- ); and the very nice thing would be a way to suggest the appropriate conversion for a given string (for example, foreign people names could be written differently in cyrillic and latin, maybe something like "blablabla ={Latn:Saratxaga|Cyrl:Сарачага}= blabla", that would be displayed as "blablabla Saratxaga blaba" or "блаблаблабла Сарачага блаблабла" but not as "блаблабла Саратхага блаблабла" maybe the Latn:/Cyrl: could be removed, as the script can be found from the strings, syntax will then be easier for the editors: "blablabla ={Saratxaga|Сарачага}= blabla" or, if they write in cyrillic: "блаблаблабла ={Сарачага|Saratxaga}= блаблабла"

Maybe we can start to create the transliteration rules for some languages and put them somewhere (on meta?)

-- Ki ça vos våye bén, Pablo Saratxaga http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Catalan or Esperanto] [min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]

zhengzhu

12 Apr 12 Apr

4:06 a.m.

...

very nice indeed, a lot of languages should benefit of such feature.

However, languages that have latin script in their list show that there are various exceptions to implement:

html entities (starting with '&' and ending with ';') must not be converted.

noted.

...

interwiki links should be handled as exceptions too (as the list of valid interwiki domains is known (eg, the possible xx in [[xx:foo]]) it should be easy to implement

it would be nice also to detect urls and not convert them

The conversion happens at a rather late stage of the wiki parser at which the input should be largely (x)html. I have attempted to avoid converting them but may have missed something. Can you provide an example at the test site when this is not done correctly?

...

then, it would be need to have a way to force conversion even for things otherwise being exceptions (that is, the opposite of -{ }- ); and the very nice thing would be a way to suggest the appropriate conversion for a given string (for example, foreign people names could be written differently in cyrillic and latin, maybe something like "blablabla ={Latn:Saratxaga|Cyrl:Сарачага}= blabla", that would be displayed as "blablabla Saratxaga blaba" or "блаблаблабла Сарачага блаблабла" but not as "блаблабла Саратхага блаблабла" maybe the Latn:/Cyrl: could be removed, as the script can be found from the strings, syntax will then be easier for the editors: "blablabla ={Saratxaga|Сарачага}= blabla" or, if they write in cyrillic: "блаблаблабла ={Сарачага|Saratxaga}= блаблабла"

This function is built in and is running at ZH. At the BE test site, you can use the following syntax (note how close it is to your suggestion:)

-{be-cyrillics: Foo; be-latin: Bar}-

This will show "Foo" in cyrillics mode, and "Bar" in latin mode.

-- zhengzhu

Pablo Saratxaga

10:12 p.m.

Kaixo!

On Mon, Apr 11, 2005 at 07:06:19PM -0400, zhengzhu wrote:

...

...

interwiki links should be handled as exceptions too (as the list of valid interwiki domains is known (eg, the possible xx in [[xx:foo]]) it should be easy to implement

it would be nice also to detect urls and not convert them

The conversion happens at a rather late stage of the wiki parser at which the input should be largely (x)html. I have attempted to avoid converting them but may have missed something. Can you provide an example at the test site when this is not done correctly?

For interwiki links, the test site doesn't recognize them as interwiki links, so I don't know if the code handles it correctly or not...

For urls, I didn't tested enclosed urls (eg: [http://wa.wikipedia.org/ foo]) but plain text ones (eg: http://wa.wikipedia.org/ or pablo@walon.org in the article, without square brackets around them).

I did some more tests, in order to have an url not translitered, I have to write:

[http://wa.wikipedia.org/ -{http://wa.wikipedia.org/%7D-]

and for an email (which hasn't any special meaning in wiki syntax, unlike the http url): -{pablo@walon.org}-

well, we can live with it, indeed.

...

...
maybe something like "blablabla ={Latn:Saratxaga|Cyrl:Сарачага}= blabla", that would be displayed as "blablabla Saratxaga blaba" or "блаблаблабла Сарачага блаблабла" but not as "блаблабла Саратхага блаблабла" maybe the Latn:/Cyrl: could be removed, as the script can be found from the strings, syntax will then be easier for the editors: "blablabla ={Saratxaga|Сарачага}= blabla" or, if they write in cyrillic: "блаблаблабла ={Сарачага|Saratxaga}= блаблабла"

This function is built in and is running at ZH. At the BE test site, you can use the following syntax (note how close it is to your suggestion:)

-{be-cyrillics: Foo; be-latin: Bar}- This will show "Foo" in cyrillics mode, and "Bar" in latin mode.

It would be better imho to standardize on ISO 15924 script codes (with possible site local aliases, same way as "Talk:" etc can be translated, but "Talk:" always work in all wikipedias)

and I also think that "|" would be a better separator (as it is used in a lot of other wikisyntax), and it would allow to have ";" included in the -{ }- block.

an auto-detection of the script, so that there is no need to explicitely tell it, will be the candy on the top (it is not always possible for Hans/Hant, indeed; but for most other cases of multi-sript needs there is no ambiguity at all), that is also why use of "|" would be better: in -{Foo;Bar}- you can't be sure that the ";" is used as a separator, while in -{Foo|Bar}- the likelihood od "|" being part of the text is much, much smaller (and in such odd cases, <nowiki>|</nowiki> could be used).

Is it possible to add a test case for a latin<->arabic site? (Kurdish or Azeri are two likely candidates), due to the right-to-left nature of arabic script, it could show some more problems that aren't seen in a cyrillic<->latin case.

Thanks

monk＠zoomcon.com

27 Mar 27 Mar

1:12 p.m.

New subject: Re[2]: Different alphabets for the same language

Hello zhengzhu,

Saturday, March 26, 2005, 9:19:32 PM, you wrote:

...

You can take a look at LanguageZh.php, which implements most stuff related to the Chinese conversion system.

thanks, I'll look at it.

...

If the conversion between Cyrillics and Latin is strictly one-to-one, then it should be fairly simple to support that, since most related code is already written. All we need is probably a mapping table between the Cyrillics and the Latin alphabet.

I'll send you private mail.

...

If you can put the table somewhere, I can probably implement it and put up a test site fairly quickly. But since I have no knowledge with those languages, I can be wrong.

Ok. It is great. The only problem is with latin names, interwiki links and so on. I'm not sure how to handle it. May be converter from Cyrillics to Latin should put some tags around originally latin strings; so Latin to Cyrillics converter will know what not to translate.

-- Best regards, monk mailto:monk@zoomcon.com

Nikola Smolenski

3:20 a.m.

On Friday 25. March 2005. 17:58, monk@zoomcon.com wrote:

...

Belarusian language (http://en.wikipedia.org/wiki/Belarusian_language) has now two quite widely used alphabet versions - Cyrillics and Latin (actually, it also has Arabic alphabet, but it is too rarely used).

I have to say that this is not true at all. For example, Google search on "так" (which means "yes", a fairly common word) on .by domain returns 571,000 hits, while search on "tak" returns 6,110 hits, of which not all are in Belarusian. About one percent of use is not "widely used" by any standard, I wouldn't even say that it is rarely used.

...

For now, be: wikipedia uses Cyrillics. But we really need Latin version for those, who prefers to use this alphabet. We have strict bidirectional rules to transform any text between Cyrillics <-> Latin.

I also have to say that this is even less true. Nowhere on the Belarusian Wikipedia have I found that anyone needs Latin version of it, there's no vote on the main page about it, noone has ever asked about it on Аб'явы, or anywhere else where I looked. Possibly "we" doesn't mean "Wikipedia users", but then I'd be interested to know who are "we" who need Latin version of Belarusian Wikipedia.

Milos Rancic

12 Apr 12 Apr

2:27 a.m.

I am implementing the transliteration from Cyrillic to Latin for Serbian Wikipedia. I am not sure that Serbian or Belarussian transliteration can pass "live transliteration" without problems.

Chinese situation is almost clean: One set of characters should be changed into another set. If they use Latin (or Cyrillic) alphabet (for referencing) or Arabic numbers, they would not change it during transliteration.

Transliteration between Cyrillic and Latin alphabets is complicated because of a number of problems (Serbian Latin and Cyrillic ortographies have some differences, too): If Latin alphabet in Belarussian has equal status (such as in Serbian, almost) with Cyrillic, you can't forbid writing in Latin. And if you want to transliterate from Latin to Cyrillic, you'll have a lot of English words transliterated in Cyrillic. Also, what about referencing in Cyrillic? If you add some Russian bibliography, you'll have Russian text transliterated into Latin.

So, I started to build some (static) infrastructure (using pywikipediabot) for Serbian. As the situation is similar for Belarussian, we can do it together (or you can use my code when I finish it for Serbian Wikipedia).

On Mar 25, 2005 6:58 PM, monk@zoomcon.com monk@zoomcon.com wrote:

...

Hello wikitech-l,

Belarusian language (http://en.wikipedia.org/wiki/Belarusian_language) has now two quite widely used alphabet versions - Cyrillics and Latin (actually, it also has Arabic alphabet, but it is too rarely used).

For now, be: wikipedia uses Cyrillics. But we really need Latin version for those, who prefers to use this alphabet. We have strict bidirectional rules to transform any text between Cyrillics <-> Latin.

We are interested in creating of a "live mirror" (automatic translator) between Cyrillics and Latin alphabets for be: Wikipedia. I mean, it would be great, if anyone could read and submit any article in either alphabet.

As far as i know, something similar was created for different alphabets of Chinese language, so this issue should be worked over already.

I'm myself an experienced PHP+MySQL developer, so I can directly participate in this project.

Can anyone provide their thoughts and any help about this issue? It is surely interesting and quite important thing.

Thank you.

-- Best regards, Monk ([[en:User:Monkbel]] mailto:monk@zoomcon.com

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Ray Saintonge

2:53 a.m.

Milos Rancic wrote:

...

And if you want to transliterate from Latin to Cyrillic, you'll have a lot of English words transliterated in Cyrillic. Also, what about referencing in Cyrillic? If you add some Russian bibliography, you'll have Russian text transliterated into Latin.

Good bibliographic practice would have it that references remain in their original language and script. Any transliterated title would only be an addition to what is already there, and is mostly for the purpose of being able to put mixed script references in the same alphabetical order. Listing references in their own language warns the potential user that about the language that he must read if he wants to go there. If you're creating links a Roman link to a Cyrillic text won't work.

Serbs have a history of being able to go back and forth between scripts, but that habit is not shared in many other cultures. I haven't been following the proposals to change scripts for Belorussian, but I anticipate that it would take at least a full generation before people start to be comfortable with the change.

monk＠zoomcon.com

2:37 p.m.

New subject: Re[2]: Different alphabets for the same language

Hello Milos,

Tuesday, April 12, 2005, 12:27:27 AM, you wrote:

...

I am implementing the transliteration from Cyrillic to Latin for Serbian Wikipedia. I am not sure that Serbian or Belarussian transliteration can pass "live transliteration" without problems.

Problems obviously exist, but they are solvable.

...

Transliteration between Cyrillic and Latin alphabets is complicated because of a number of problems (Serbian Latin and Cyrillic ortographies have some differences, too): If Latin alphabet in Belarussian has equal status (such as in Serbian, almost) with Cyrillic, you can't forbid writing in Latin. And if you want to transliterate from Latin to Cyrillic, you'll have a lot of English words transliterated in Cyrillic.

Here is suggestion to put some edit-time markup around non-convertable Latin words.

...

Also, what about referencing in Cyrillic? If you add some Russian bibliography, you'll have Russian text transliterated into Latin.

Here is the problem no solution was found for yet. I think we will end up with some markup just in database in the end.

...

So, I started to build some (static) infrastructure (using pywikipediabot) for Serbian. As the situation is similar for Belarussian, we can do it together (or you can use my code when I finish it for Serbian Wikipedia).

Great, could you show test site or something? could you explain the idea of your work, especially why do you use pywikipedia?

-- Best regards, monk mailto:monk@zoomcon.com

Milos Rancic

3:37 p.m.

New subject: Re[2]: Different alphabets for the same language

Yes, problems can be solved after some time of work. I just think that we are far a way of live transliteration (like in Chinese). But static transliterations can be test for future live transliteration.

Markup language is a good idea, but I am not sure how often would be used. Some people would not use because they don't know how to use, some people would not use because they don't like other alphabet (the second problem is very real in Serbian culture :) ). However, it would be useful if MediaWiki would support something like "don't touch marking" (for example "---"; or something like that). For the beginning it would be used for the bot parsing; but when algorithm become stable, it should be used by MediaWiki engine.

At the page [[:sr:Корисник:Millbot/извештаји]] (http://sr.wikipedia.org/wiki/%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D0%B...) can be found bot's reports during it's work one and half month ago. (I stopped with transliteration because I have to make some other parts of bot's infrastructure.)

You can find that part of bot's source code at http://millosh.org/software/ltafos/pos/ (use version 0.3.2; I have some newer on my laptop, but it is not stable). Unpack pywikipediabot (http://sourceforge.net/projects/pywikipediabot/, unpack pos, go to the pos directory and link all of pywikipediabot files into pos directory. Note that pos is in something like alpha stage and you should be careful when use it. There are some documentation in English, but Serbian documentation is better (as you are Belorussian, I think you can understand documentation in Serbian).

In this moment I am working on some other bot's tasks, but I'll finish it in month or two and I'll continue to work on transliteration. Bot Millbot should be the system with IRC and pywikipediabot parts. It should monitor channels like #srrc.wikipedia and work according to informations which it gets there: When someone make some change in Cyrillic text, it should immediately implement changes into Latin text and vice versa.

At the first stage of transliteration Millbot would make changes and synchronization only according to Cyrillic texts. When it become more clever, it should work with Latin texts, too. But it is a long time task.

Also, Bosnian Wikipedia has almost the same situation as Serbian. They starts from Latin and, according to their culture (and their Main Page), they should have Cyrillic pages, too. When I implement Latin->Cyrillic transliteration for Serbian, it would be Latin->Cyrillic transliteration implementation for Bosnian, too.

And, one more thing: I think we should move our talk to pywikipediabot list.

zhengzhu

5:06 p.m.

On Apr 11, 2005 5:27 PM, Milos Rancic millosh@gmail.com wrote:

...

I am implementing the transliteration from Cyrillic to Latin for Serbian Wikipedia. I am not sure that Serbian or Belarussian transliteration can pass "live transliteration" without problems.

Chinese situation is almost clean: One set of characters should be changed into another set. If they use Latin (or Cyrillic) alphabet (for referencing) or Arabic numbers, they would not change it during transliteration.

Transliteration between Cyrillic and Latin alphabets is complicated because of a number of problems (Serbian Latin and Cyrillic ortographies have some differences, too): If Latin alphabet in Belarussian has equal status (such as in Serbian, almost) with Cyrillic, you can't forbid writing in Latin. And if you want to transliterate from Latin to Cyrillic, you'll have a lot of English words transliterated in Cyrillic. Also, what about referencing in Cyrillic? If you add some Russian bibliography, you'll have Russian text transliterated into Latin.

The Chinese wikipedia has similar problems, although for different reasons. For example, sometimes people's names shouldn't be converted at all no matter what variant is in use, sometimes different variants translates foreign words differently. So there is a user customizable dictionary for each language variant that can be used to define such special conversion rules. There is also a special markup that can be used in the text to define specific conversion rules just for that piece of text.

In the case of converting Latin to Cyrillics, I think the same thing can be used. The conversion table can be augmented with words and phrases that should not be converted to Cyrillics under any condition. Those words that can both be English and Serbian (or Belarussian) can be manually marked up in the text.

-- zhengzhu

Milos Rancic

5:36 p.m.

On Apr 12, 2005 2:06 PM, zhengzhu zhengzhu@gmail.com wrote:

...

The Chinese wikipedia has similar problems, although for different reasons. For example, sometimes people's names shouldn't be converted at all no matter what variant is in use, sometimes different variants translates foreign words differently. So there is a user customizable dictionary for each language variant that can be used to define such special conversion rules. There is also a special markup that can be used in the text to define specific conversion rules just for that piece of text.

In the case of converting Latin to Cyrillics, I think the same thing can be used. The conversion table can be augmented with words and phrases that should not be converted to Cyrillics under any condition. Those words that can both be English and Serbian (or Belarussian) can be manually marked up in the text.

1. I can just guess what is written in Chinese interface, so how did you cover article names? Do you have both names: in Simplified and Traditional Chinese?

2. I think MediaWiki should have one general module for transliteration with extensions for specific languages. General module should be based on Chinese module. Is it possible to start to work in such way?

3. Also, we should try to make system clever: Some formal and some statistic methods can help in recognizing should we transliterate something or not (i.e.: if system find some non-Serbian Cyrillic letters, it should not transliterate it into Latin and vice versa).

zhengzhu

5:55 p.m.

...

I can just guess what is written in Chinese interface, so how did

you cover article names? Do you have both names: in Simplified and Traditional Chinese?

In most cases titles are converted using the same conversion system, i.e. using the conversion table and do a strtr(). For wierd situations, there is also support for manually specifying title conversion inside the article body, using this syntax: -{T|zh-cn:foo; zh-tw: bar}-

Also built in is the ability to find articles that are written in different variants when wikilinking.

Here is an example: Let's say that in the conversion table, "foo" in zh-cn is converted to "bar" in zh-tw and vice-versa. Now someone writing in zh-cn wrote an article titled "foo". When someone with zh-tw preferred sees the article, "bar" will be shown as the article title. Further, say someone using zh-tw edited some article which has a link [[bar]]. The system will identify that the article "foo" should be used for linking, if "bar" is not already created as a redirect.

btw, you should be able to change the interface at zh after you register an account;)

...

I think MediaWiki should have one general module for

transliteration with extensions for specific languages. General module should be based on Chinese module. Is it possible to start to work in such way?

Indeed. I have always anticipated that the Chinese system can be generalized to other languages. Most of the code for the Chinese system is not specifically tied to the Chinese language, and some code refactoring can be done to provide better support for different languages. Please watch CVS HEAD for the next couple weeks for this to happen.

...

Also, we should try to make system clever: Some formal and some

statistic methods can help in recognizing should we transliterate something or not (i.e.: if system find some non-Serbian Cyrillic letters, it should not transliterate it into Latin and vice versa).

That's certainly doable within the current system framework, but will require more specialized algorithms.

-- zhengzhu

Milos Rancic

7:16 p.m.

...

In most cases titles are converted using the same conversion system, i.e. using the conversion table and do a strtr(). For wierd situations, there is also support for manually specifying title conversion inside the article body, using this syntax: -{T|zh-cn:foo; zh-tw: bar}-

I am thinking about more general solution: To make database table with exceptions. And, more general, to make some kind of interaction with Wiktionary.

...

Here is an example: Let's say that in the conversion table, "foo" in zh-cn is converted to "bar" in zh-tw and vice-versa. Now someone writing in zh-cn wrote an article titled "foo". When someone with zh-tw preferred sees the article, "bar" will be shown as the article title. Further, say someone using zh-tw edited some article which has a link [[bar]]. The system will identify that the article "foo" should be used for linking, if "bar" is not already created as a redirect.

What do you keep in database? Simplified, traditional or both?

...

btw, you should be able to change the interface at zh after you register an account;)

I remember that I was looking few minutes at left up corner of MS Excel when I tried to find position of "File" in Hebrew MS Office :) (it is at right up corner). The situation with Chinese interface is similar :)

I saw a couple of days ago that if I click on Traditional Chinese, I'll get "ugly" link with parameter "variant=zh-tw". Is it possible that Simplified Chinese has URL in Simplified and Traditional Chinese in Traditional Chinese? Or mod_rewrite redirection:

http://zh.wikipedia.org/wiki/<something in Traditional Chinese> is shown, but http://zh.wikipedia.org/wiki/<something in Simplified Chinese>...?variant=zh-tw is read?

...

Indeed. I have always anticipated that the Chinese system can be generalized to other languages. Most of the code for the Chinese system is not specifically tied to the Chinese language, and some code refactoring can be done to provide better support for different languages. Please watch CVS HEAD for the next couple weeks for this to happen.

I knew that Chinese have two alphabets, but I didn't have in my mind that problems are similar to Serbian :) Of course, I found it a couple of months ago...

...

...

Also, we should try to make system clever: Some formal and some

statistic methods can help in recognizing should we transliterate something or not (i.e.: if system find some non-Serbian Cyrillic letters, it should not transliterate it into Latin and vice versa).

That's certainly doable within the current system framework, but will require more specialized algorithms.

Inside of my extension to pywikipedia bot (http://millosh.org/software/ltafos/pos/), I have statistic guesser: algorithm gueses distance between two texts (something like so called edit distances, but stochastic, so it can compare texts in real time, not only words and phrases). I am using it to guess if page is in Serbian or not. However, it can be used (in future forms) for other kinds of stochastic guessing.

Nikola Smolenski

13 Apr 13 Apr

2:51 a.m.

On Tuesday 12 April 2005 15:16, Milos Rancic wrote:

...

...
Indeed. I have always anticipated that the Chinese system can be generalized to other languages. Most of the code for the Chinese system is not specifically tied to the Chinese language, and some code refactoring can be done to provide better support for different languages. Please watch CVS HEAD for the next couple weeks for this to happen.

I knew that Chinese have two alphabets, but I didn't have in my mind that problems are similar to Serbian :) Of course, I found it a couple of months ago...

Actually, Chinese doesn't have any alphabet ;)

Frankly, to me this entire conversation sounds like much ado about nothing. Even on Serbian Wikipedia noone actually said that he actually needs Latin alphabet, and a few people said that they are for Latin because maybe someone would need it. So I don't see why introducing it. (Chinese situation is clear, as there are people who can't understand one or the other script.) And I can imagine that the situation is even clearer on Belarusian Wikipedia, where I simply couldn't find that anyone even mentions transliteration to Latin; frankly, it seems to me that Monk would want to introduce it secretly without asking Belarusian users whether they want it (and whether they would want not to have such a feature).

Milos Rancic

2:46 a.m.

On Apr 12, 2005 11:51 PM, Nikola Smolenski smolensk@eunet.yu wrote:

...

Frankly, to me this entire conversation sounds like much ado about nothing. Even on Serbian Wikipedia noone actually said that he actually needs Latin alphabet, and a few people said that they are for Latin because maybe someone would need it. So I don't see why introducing it.

This is not true. We voted about it last summer. One person explicitly asked for Latin alphabet [[sr:Корисник:Sergivs]], two said that it should be used, six voted against (seventh person voted after the conclusion of voting). I asked about consequences of voting at wikimeda-l list and answer was clear: if Latin alphabet wouldn't be included into sr.wikipedia, it would be tyranny of majority. After all, we can vote about it again and situation in this moment would be more pro-Latin.

Nikola, your political reasons for obstruction of introduction of Latin alphabet into Serbian Wikipedia are clear. But, Serbian Wikipedia should be used by all Serbian speaking people, not only by one political faction.

Nikola Smolenski

11:24 a.m.

On Tuesday 12 April 2005 22:46, Milos Rancic wrote:

...

On Apr 12, 2005 11:51 PM, Nikola Smolenski smolensk@eunet.yu wrote:

...
Frankly, to me this entire conversation sounds like much ado about nothing. Even on Serbian Wikipedia noone actually said that he actually needs Latin alphabet, and a few people said that they are for Latin because maybe someone would need it. So I don't see why introducing it.

This is not true. We voted about it last summer. One person explicitly asked for Latin alphabet [[sr:Корисник:Sergivs]], two said that it should be used, six voted against (seventh person voted after the

You don't say. Well, as I said, and you confirmed, noone said that he actually NEEDS Latin alphabet. Yes, some WANTED it, but noone NEEDED it. Noone complained of not being able to see Cyrillic or write in Cyrillic or about anything else. In addition, the vote was about "inclusion of Latin pages", not about whether each and every page would be in Latin!

...

conclusion of voting). I asked about consequences of voting at wikimeda-l list and answer was clear: if Latin alphabet wouldn't be included into sr.wikipedia, it would be tyranny of majority. After

I am reading that thread right now and I can say you salted it really fine. "In this moment we have very strong xenophobic movement. A lot of them are very active on Internet and on Wikipedia, too. They do not accept Latin alphabet as "Serbian"."

Tyranny of majority is better than tyranny of minority.

...

all, we can vote about it again and situation in this moment would be more pro-Latin.

Could we vote if it would be more pro-Cyrillic?

...

Nikola, your political reasons for obstruction of introduction of Latin alphabet into Serbian Wikipedia are clear. But, Serbian

Well if they are so clear perhaps then you can explain them to me because I don't see them so well, and other people on this list might not see them too.

...

Wikipedia should be used by all Serbian speaking people, not only by one political faction.

What prevents Serbian Wikipedia from being used by all Serbian speaking people right now?

(Milos sent this message to me, and CCed it to Wikitech list; perhaps the CC is the reason for it didn't appear on the list.)

Pablo Saratxaga

3:22 a.m.

Kaixo!

On Tue, Apr 12, 2005 at 10:51:57PM +0100, Nikola Smolenski wrote:

...

Frankly, to me this entire conversation sounds like much ado about nothing. Even on Serbian Wikipedia noone actually said that he actually needs Latin

For Serbian maybe it isn't needed, for Belarussian I didn't even knew there was anyone using latin alphabet; but there are languages for wich such a feature is absolutely needed, like Kurdish, Azeri, Uzbek, Amazigh, Punjabi, where there are very siezable communities with different writting systems; possibly also Uighur, Uzbek.

Erdal Ronahi

3:29 a.m.

Hi,

I am admin at the Kurdish Wikipedia, and for Kurdish I absolutely agree. Both writing systems are used by millions of people, and we are trying to keep them both in one Wikipedia. Any technical help is warmly welcomed. :-)

Greetings, Erdal

...

...
Frankly, to me this entire conversation sounds like much ado about nothing. Even on Serbian Wikipedia noone actually said that he actually needs Latin

For Serbian maybe it isn't needed, for Belarussian I didn't even knew there was anyone using latin alphabet; but there are languages for wich such a feature is absolutely needed, like Kurdish, Azeri, Uzbek, Amazigh, Punjabi, where there are very siezable communities with different writting systems; possibly also Uighur, Uzbek.

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- http://www.ferheng.org

Milos Rancic

3:39 a.m.

On 4/13/05, Pablo Saratxaga pablo@mandriva.com wrote:

...

For Serbian maybe it isn't needed,

Huh... Maybe I can say that I can't belive. Around 50% of Serbs are using Latin alphabet as their primary alphabet! I am using Cyrillic, but my girlfriend and a lot of my friends use Latin as their primary alphabet. Two of my bank cards are written in Latin alphabet and one is written in Cyrillic. Almost all companies print their visit-cards in Latin alphabet (in Serbian). A lot of people don't contribute to Serbian Wikipedia because we are waiting for Latin implementation... Come to Belgrade and see the situation.

In formal way, Cyrillic is official alphabet. So, if we have to chose between that two alphabets, Cyrillic is "more equal". But, if we can use both, we should use both.

Pablo Saratxaga

3:58 a.m.

Kaixo!

On Wed, Apr 13, 2005 at 12:39:10AM +0200, Milos Rancic wrote:

...

On 4/13/05, Pablo Saratxaga pablo@mandriva.com wrote:

...
For Serbian maybe it isn't needed,

Huh... Maybe I can say that I can't belive. Around 50% of Serbs are using Latin alphabet as their primary alphabet! I am using Cyrillic,

I didn't say that it wasn't needed for Serbian, just that I didn't knew if sr: users needed it (which is different of wanted; yes it would be a welcome feature, but it isn't absolutely needed, as anyone litterate in Serbian language can read cyrillic; the situation is different for other languages where different communities use different alphabets and may not be able to read the other).

What I wanted to point out is that, even if the latin/cyrillic transliteration feature is not used in sr: or be:, the exploration of it is still usefull, as there are other languages that would benefit of it.

(btw, I wasn't aware that the proportion of latin alphabet users was so high; yes I knew that anyone could write it, and probably used it in thinks like sms or email, but thought than when it came to writting with a pen on a paper most people used cyrillic)

...

But, if we can use both, we should use both.

I fully agree (and there is no need to have a 50%-50% situation); btw in my work (I'm responsible of localization for a software company) I have long ago decided to provide the choice of both writtings (by requiring translators to use cyrillic, then converting from cyrillic to latin, as that is easy to do, while the conversion from latin to cyrillic is painfull, due to the high amount of things that must remain unchanged (urls, command names, file and path names, email addresses, etc.))

Milos Rancic

6:50 p.m.

...

I didn't say that it wasn't needed for Serbian, just that I didn't knew if sr: users needed it (which is different of wanted; yes it would be a welcome feature, but it isn't absolutely needed, as anyone litterate in Serbian language can read cyrillic; the situation is different for other languages where different communities use different alphabets and may not be able to read the other).

...

(btw, I wasn't aware that the proportion of latin alphabet users was so high; yes I knew that anyone could write it, and probably used it in thinks like sms or email, but thought than when it came to writting with a pen on a paper most people used cyrillic)

Thank you. Now, I understand consequences of Nikola's xenophobic propaganda. He made equal SMS and email communication with poetry and scientific papers which is written (and which is writing) in Latin alphabet.

So, the question is do we need it. The answer is: No, we can use ASCII characterset and English language for communication and we don't need even Serbian Wikipedia.

Look, I am sick of two confronted main stream (sick) factions: One is xenophobic and rejects all of Western influences into Serbian culture and another is exponent of Western cultural imperialism and rejects all of traditional parts of Serbian culture. The first doesn't want to see Latin alphabet see Serbian alphabet, the another doesn't want to see Cyrillic alphabet as Serbian alphabet. Xenophobic faction is better organized, but exponents of Western cultural imperialism have very strong "unofficial" centers. In the substance, both of factions are the same: they are aggressive and they are making pressure to others to be like them. After the first half of 20th century we have word for people like them: they are fascists. And I hope Wiki(p|m)edia is not project which support fascists.

It seems that I have to explain cultural situation in Serbia to Wiki(p|m)edia's mailing lists, from time to time. So, let me try again.

Children in Serbia, Montenegro and Republic of Srpska (the part of Bosnia and Herzegovina) learns Cyrillic during the first year of primary school (at the age of 6-7 years). All of them (I am not sure for Republic of Srpska, but I think so) learn Latin alphabet during the second year of primary school (at the age of 7-8 years). If someone don't know Latin alphabet, (s)he didn't finish the second year of primary school. And this situation is during around 50 years in Serbian culture.

Introduction of Latin alphabet into Serbian culture made Serbian philologist Djuro Danicic, who was working as secretary of Croatian Academy of Sciences and Arts (the name was "Yugoslavian Academy", not "Croatian") in Zagreb during the 19th century. But, Serbs started to use widely Latin alphabet between the end of WWII and 1960s.

In 1970s and 1980s Latin alphabet was dominant in Serbia. Latin is treated as "fancy", "modern", as well as other bullshits. When nationalists came to power with Milosevic, Cyrillic alphabet started to be used with characterization of Latin alphabet as "anti-Serb", as "conspiracy against Serbs", as well as other bullshits. After Milosevic, both of factions are strong enough.

As Internet had it's own, ASCII based rules, during 1990s we had a lot of problems with introduction of Cyrillic alphabet into computers. Also, computer technology came from the West with Western cultural imperialism: During 1990s there were no a lot of computer workers which had any idea about using Cyrillic alphabet (nor even Serbian Latin alphabet; they used only ASCII).

We have deeply divided society. In this situation we are talking about introduction of Latin alphabet. A few years ago analogue situation was with introduction of Cyrillic alphabet into KDE and Microsoft products.

And, during 1990s, opposition to Cyrillic inside of computer circles was very very strong. Again, anyone who was working on Cyrillic was characterized as "pro-Milosevic", as "nazi" (with, of course, other kinds of such bullshits).

My conclusion is:

If you (I mean, all of people involved into Wikimedia) respect Serb culture, you have to understand that Serbian culture has two alphabets.

If there is no possibility for two alphabets (but, I don't see that MediaWiki doesn't have that possibility), I think that it should be Cyrillic (I think that, but not all of people from Serbia, some people think that it should be Latin!). Cyrillic is better choice because of transliteration, for sure. But, if we have possibility to use two alphabets (we are not working on ASCII or 8-bit terminals anymore, Unicode became standard), we should use it.

It is the matter of culture, not the matter of understanding. Take care about: a lot of people don't want (or don't like, or don't know) to use Cyrillic alphabet at computers. And some of them want to become contributors to Serbian Wikipedia, but they don't want (don't like, don't know) to do that in Cyrillic.

With only Cyrillic alphabet half of Serbs are excommunicated from Serbian Wikipedia. Nikola wants that. I am wandering if others want that?

...

...
But, if we can use both, we should use both.

I fully agree (and there is no need to have a 50%-50% situation); btw in my work (I'm responsible of localization for a software company) I have long ago decided to provide the choice of both writtings (by requiring translators to use cyrillic, then converting from cyrillic to latin, as that is easy to do, while the conversion from latin to cyrillic is painfull, due to the high amount of things that must remain unchanged (urls, command names, file and path names, email addresses, etc.))

I am glad to know that there are computer people outside of Serbia who take care about Serb culture when there are no so much computer people in Serbia who take care about it.

Yes, conversion from Latin to Cyrillic is painful. But, we (or I) have to make it in the future. A lot of texts in Serbian are written in Latin alphabet and we need it in Cyrillic, too.

Lars Aronsson

9:38 p.m.

Milos Rancic wrote:

...

If you (I mean, all of people involved into Wikimedia) respect Serb culture, you have to understand that Serbian culture has two alphabets.

If there is no possibility for two alphabets (but, I don't see that MediaWiki doesn't have that possibility), I think that it should be

If you (I mean, all Serbs) respect wikitech-l culture, you have to understand that there can only be one language: PHP.

If you can write a solution in PHP for having one MediaWiki site where every article can be displayed in two different alphabets at the same time (according to each user's preference, I guess), you are welcome to introduce this feature in the next release. Go ahead.

Clearly, the Serbian Wikipedia cannot be the only website in Serbian. There must reasonably be other Serbian websites which have already addressed and solved this problem, and from where you can get ideas for a practical and working solution.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Milos Rancic

10:18 p.m.

On 4/13/05, Lars Aronsson lars@aronsson.se wrote:

...

Milos Rancic wrote:

...
If you (I mean, all of people involved into Wikimedia) respect Serb culture, you have to understand that Serbian culture has two alphabets.

If there is no possibility for two alphabets (but, I don't see that MediaWiki doesn't have that possibility), I think that it should be

If you (I mean, all Serbs) respect wikitech-l culture, you have to understand that there can only be one language: PHP.

:) I am working on it according to Zhengzhu's principles. (It sound's like some kind of Chinese philosophy :) )

Nikola Smolenski

14 Apr 14 Apr

4:27 a.m.

On Wednesday 13 April 2005 17:38, Lars Aronsson wrote:

...

Clearly, the Serbian Wikipedia cannot be the only website in Serbian. There must reasonably be other Serbian websites which have already addressed and solved this problem, and from where you can get ideas for a practical and working solution.

Some time ago, I wrote a text about this topic, so I believe that I am the right person to answer to this.

There are several ways of having bi-alphabetic web site. If your web site is static (that is, serving HTML files, no scripting), you have to create entire tree of your site in both alphabets. There are tools which can help with this, but there is nothing which you can just point to a Cyrillic web site and which can output you entire Latin web site, so some work has to be done manually. If you have a static site, but also the ability of scripting, a simple solution could be a script which would upon a request load a static page, transliterate it and serve it; but I've never seen this actually implemented. If your web site is dynamic, but done in "raw" PHP or some other language (you again have mostly HTML but use PHP to insert headers, footers and the like), you can add a small script on top which, if requested, would buffer entire page, transliterate it and output it; there is one excellent tool for this though it has somewhat strange licensing, and of course it is not hard to code such a thing from scratch. Finally, if your site is fully dynamic, done over a CMS for example, you either have to use transliteration built in the software (I don't know of any software having such a thing), or you can use solution similar to one I mentioned, have a script which would load a finished page, transliterate it and serve it (I believe I have seen this a few times). Regardless of which way you use, a thing sometimes overlooked are buttons and other images with inscriptions on them; they have to exist in both variants too (and I wonder is this problem addressed in some way on Chinese Wikipedia?).

However, and what is important for this discussion, everyone who has such a site has primary variant in Cyrillic and makes Latin variant from it; I think that only Radio Television of Serbia transliterates from Latin to Cyrillic, but as one might guess it's doesn't turn out very nice.

Nikola Smolenski

4:58 a.m.

On Wednesday 13 April 2005 14:50, Milos Rancic wrote:

...

...
I didn't say that it wasn't needed for Serbian, just that I didn't knew if sr: users needed it (which is different of wanted; yes it would be a welcome feature, but it isn't absolutely needed, as anyone litterate in Serbian language can read cyrillic; the situation is different for other languages where different communities use different alphabets and may not be able to read the other).

...

(btw, I wasn't aware that the proportion of latin alphabet users was so high; yes I knew that anyone could write it, and probably used it in thinks like sms or email, but thought than when it came to writting with a pen on a paper most people used cyrillic)

Thank you. Now, I understand consequences of Nikola's xenophobic propaganda. He made equal SMS and email communication with poetry and scientific papers which is written (and which is writing) in Latin alphabet.

The first time that words "poetry" or "scientific papers" are mentioned in this thread, is this email of yours to which I am replying to. I have never mentioned either. I believe that it is easy to see who's conducting propaganda here.

And, please, in a previous email you said to me that "your political reasons for obstruction of introduction of Latin alphabet into Serbian Wikipedia are clear". As you are the founder of anarchopedia, is it possible that you might have some political reasons too?

...

So, the question is do we need it. The answer is: No, we can use ASCII characterset and English language for communication and we don't need even Serbian Wikipedia.

No, we can not. Majority of people in Serbia doesn't know a word of English and so needs Serbian Wikipedia.

...

Look, I am sick of two confronted main stream (sick) factions: One is xenophobic and rejects all of Western influences into Serbian culture and another is exponent of Western cultural imperialism and rejects all of traditional parts of Serbian culture. The first doesn't want to see Latin alphabet see Serbian alphabet, the another doesn't want to see Cyrillic alphabet as Serbian alphabet. Xenophobic faction is better organized, but exponents of Western cultural imperialism have very strong "unofficial" centers. In the substance, both of factions

I see that you see the world in black and white. Well, the world isn't black and white, there are shades of gray inbetween, and I have never seen an organisation which "rejects all of Western influences into Serbian culture" nor one which "rejects all of traditional parts of Serbian culture" Of organisations concerned with scripts, I also don't know of any who aims to eradicate either alphabet completely. Oh, and their organisation is the opposite of what you said - pro-Western organisations are much better organised.

...

We have deeply divided society. In this situation we are talking about introduction of Latin alphabet. A few years ago analogue situation was with introduction of Cyrillic alphabet into KDE and Microsoft products.

On the other hand, I might add that CEO of Microsoft in Belgrade was replaced because he tried to localize Windows in Latin alphabet. So, people, be careful! ;)

...

And, during 1990s, opposition to Cyrillic inside of computer circles was very very strong. Again, anyone who was working on Cyrillic was characterized as "pro-Milosevic", as "nazi" (with, of course, other

This is simply a lie.

...

transliteration, for sure. But, if we have possibility to use two alphabets (we are not working on ASCII or 8-bit terminals anymore, Unicode became standard), we should use it.

No, we should not. Why should we?

...

It is the matter of culture, not the matter of understanding. Take care about: a lot of people don't want (or don't like, or don't know) to use Cyrillic alphabet at computers. And some of them want to become contributors to Serbian Wikipedia, but they don't want (don't like, don't know) to do that in Cyrillic.

With only Cyrillic alphabet half of Serbs are excommunicated from Serbian Wikipedia. Nikola wants that. I am wandering if others want that?

No, neither is true.

A lot of people don't know how to use Cyrillic alphabet on computers. However, these same people are those who don't know how to use Latin alphabet on computers. To explain: computers usually come with English keyboards preinstalled. People who don't know how to install another keyboard use it to write Serbian, and they do so in "naked" Latin: without necessary diacritics. I agree that there is a number of such people, and that they currently can't participate on Wikipedia except in a very limited way, however naked Latin is absolutely impossible to transliterate to either Cyrillic or proper Latin so they can't participate either way. However, the very moment when someone learns how to install a keyboard, he or she can use equal procedure to install either Cyrillic or Latin keyboard, or both. Wikipedia is in UTF-8 either way, so there are no problems with encoding or other problems.

I have said, and will say again: if this would be introduced it would be tyranny of the minority. If a minority needs something, and majority can give it to them, but don't want it, that is tyranny of the majority. But if there is a minority who wishes something and a majority who does not wish that thing, then having it would be a tyranny of the minority.

Nikola Smolenski

13 Apr 13 Apr

11:24 a.m.

On Tuesday 12 April 2005 23:39, Milos Rancic wrote:

...

in Latin alphabet (in Serbian). A lot of people don't contribute to Serbian Wikipedia because we are waiting for Latin implementation...

How do you know about this?

Nikola Smolenski

2:51 a.m.

On Monday 11 April 2005 22:27, Milos Rancic wrote:

...

Chinese situation is almost clean: One set of characters should be changed into another set. If they use Latin (or Cyrillic) alphabet (for referencing) or Arabic numbers, they would not change it during transliteration.

AFAIK, similar problems can arise with Chinese: if an article has Japanese or Korean citation, it should remain unchanged...

7207

Age (days ago)

7226

Last active (days ago)

wikitech-l@lists.wikimedia.org

33 comments

9 participants

tags (0)

participants (9)

Brion Vibber
Erdal Ronahi
Lars Aronsson
Milos Rancic
monk＠zoomcon.com
Nikola Smolenski
Pablo Saratxaga
Ray Saintonge
zhengzhu