Language conversion

List overview All Threads
Download

newer

older

Wikispammer honeypot

Trophées du Libre

zhengzhu

12 Apr 2005 12 Apr '05

5:12 a.m.

I have started to write some documentation about the Chinese conversion system at meta:

http://meta.wikimedia.org/wiki/Chinese_conversion

People interested in implementing conversion systems for other languages should take a look at it. Note that most features implemented are not Chinese specific.

-- zhengzhu

Show replies by date

Neil Harris

13 Apr 13 Apr

7:01 a.m.

zhengzhu wrote:

...

I have started to write some documentation about the Chinese conversion system at meta:

http://meta.wikimedia.org/wiki/Chinese_conversion

People interested in implementing conversion systems for other languages should take a look at it. Note that most features implemented are not Chinese specific.

Fantastic work!

Regarding the word segmentation problem: Googling for "Chinese word segmentation" gives a large number of references to various interesting research on this field. There seems to be a lively research community in this field: perhaps some of them might want to earn kudos points by collaborating on improving the performance of this module by tackling some of the common segmentation issues.

Dynamic programming and Hidden Markov methods seem to be popular, and we must, by now, have a quite large corpus of our own in the zh: wikipedia.

-- Neil

Milos Rancic

7:47 a.m.

As I see, in this moment Serbian Wikipedia only needs:

1. Table for transliteration from Cyrillic to Latin.

2. Table for transliteration from Latin to Cyrillic with some exceptions: Users should have possibility to split Latin digrams "lj", "nj" and "dž" with some sign (let's say backslash: "l\j", "n\j" and "d\ž") if those digrams doesn't represent Cyrillic letter "љ", "њ" and "џ".

3. Other situations should be solved via exceptions.

I would like if we can make some interaction with Serbian Wiktionary in the future. Also, I would like to have possibility to see all exceptions from database. It would be very important linguistic material.

"And now, something completely different!" :) Serbian language has two standard alphabets and two standard variants: "ekavian" (standard in Serbia) and "iyekavian" (standard in Republic of Srpska and Montenegro). Is it possible to make some markup extension at the character (and group of characters) level?

Serbian word for "milk" is in ekavian "mleko", but in iyekavian it is "mlijeko". So, with some markup can bi written like "ml{e|ije}ko"; "bred" is "hleb" in ekavian and "hljeb" in iyekavian, so it should be written like "h{l|lj}eb". No one should be forced to use markup, but there are some people at sr: who would do that.

In other words, Chinese Wikipedia has three variants and Serbian should have four (ekavian Cyrillic and Latin, iyekavian Cyrillic and Latin).

So, please, tell me what do I need to send you for Serbian transliteration? I'll try to write something about Serbian transliteration at Meta. Also, we should make some general page about transliteration/conversion.

On 4/12/05, zhengzhu zhengzhu@gmail.com wrote:

...

I have started to write some documentation about the Chinese conversion system at meta:

http://meta.wikimedia.org/wiki/Chinese_conversion

People interested in implementing conversion systems for other languages should take a look at it. Note that most features implemented are not Chinese specific.

-- zhengzhu _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Milos Rancic

8:45 a.m.

On 4/13/05, Milos Rancic millosh@gmail.com wrote:

...

Serbian word for "milk" is in ekavian "mleko", but in iyekavian it is "mlijeko". So, with some markup can bi written like "ml{e|ije}ko"; "bred" is "hleb" in ekavian and "hljeb" in iyekavian, so it should be written like "h{l|lj}eb". No one should be forced to use markup, but there are some people at sr: who would do that.

My English is bad... Sentecne "No one should be forced to use markup, but there are some people at sr: who woudl do that." should be "No one should be forced to use markup, but there are some people at sr: who would markup texts." :)

Ray Saintonge

1:02 p.m.

Milos Rancic wrote:

...

On 4/13/05, Milos Rancic millosh@gmail.com wrote:

...
Serbian word for "milk" is in ekavian "mleko", but in iyekavian it is "mlijeko". So, with some markup can bi written like "ml{e|ije}ko"; "bred" is "hleb" in ekavian and "hljeb" in iyekavian, so it should be written like "h{l|lj}eb". No one should be forced to use markup, but there are some people at sr: who would do that.

My English is bad... Sentence "No one should be forced to use markup, but there are some people at sr: who woudl do that." should be "No one should be forced to use markup, but there are some people at sr: who would markup texts." :)

I am sure that there are some people who would both use the markup and try to force others to do it too. :-)

While I recognize that there are long established traditions favouring the two script approach to Serbian, I think that the ekavian/iyekavian distinction is going too far. It makes the idea of a single Serbian language a joke, and shows the language as incapable of establishing standards. If the variants in Republika Srpska and Montenegro insist on their own varieties it turns the whole idea of Serbian nationalism into a Gilbert and Sullivan operetta.

For the English Wikipedia to work it had to come to terms with the differences between British and American English as well as many other varieties of English. Canadian English, for example, uses a combination of the two main forms of the language. The same is true of the other European languages whose native form has varied considerably from what has become the standard in geographically distant former colonies. The places that use these other varieties of Serbian are at least contiguous with the Serbian heartland.

The rest of the world doesn't give a damn about this petty whining between Serbian nationalist factions. Maybe they should just learn to get along with each other.

Milos Rancic

1:55 p.m.

On 4/13/05, Ray Saintonge saintonge@telus.net wrote:

...

While I recognize that there are long established traditions favouring the two script approach to Serbian, I think that the ekavian/iyekavian distinction is going too far. It makes the idea of a single Serbian language a joke, and shows the language as incapable of establishing standards. If the variants in Republika Srpska and Montenegro insist on their own varieties it turns the whole idea of Serbian nationalism into a Gilbert and Sullivan operetta.

Hehehe. I understand that situation with Serbian standard variants is funny to others, but let me explain :)

Father of modern Serbian standard language is Vuk Stefanovic Karadzic. He was iyekavian by origin. He was borned at iyekavian part of western Serbia. (Today, that part of Serbia is ekavian because of Belgrade-oriented centralization of Serbia.)

But, the center of Serbian culture was not in Serbia (which became independent at the first half of 19th century). It was in Vojvodina, which was the part of Austro-Hungary empire. And, people from Vojvodina, but from Belgrade, too, was (and are) speaking ekavian.

One more but: Serbs from Montenegro, eastern Herzegovina and Bosnia was (and are) iyekavian.

So, before Djuro Danicic (Vuk's student) introduced new Latin alphabet (at the first time completely adopted by Croats, today only letter đ/Đ is from original Danicic's alphabet), Vuk and Corat Ljudevit Gaj introduced "two variants of the same language". And, that language was called with different names: "Croatian or Serbian", "Serbian or Croatian", "Serbo-Croatian" and "Croato-Serbian". As centralistic communits system didn't want to devide language to "Croatian" and "Serbian", they devided language just into iyekavian and ekavian. So, Serbs from Bosnia, Herzegovina and Montenegro was speaking the same variant of language with Croats, while Serbs from Serbia was speaking another variant. Today, standards are Serbian, Croatian and Bosnian.

When Former Yugoslavia was destroyed, as well as standard Serbo-Croatian language, Croats and Bosniaks had clear situation: The big majority of Croats and Bosniaks are iyekavian. However, Serbs didn't have that situation: some Serbs are ekavian and some Serbs are iyekavian.

During the war in Bosnia, there were one linguistic experiment (supported by low, military and police), which was introduced by Radovan Karadzic regime: Standard language became ekavian. All of radio, television, newspapers and state authorities was forced to write in ekavian. (In very polite words, I think that it was not good idea.)

Today, organization which takes care about Serbian standard is "Council for Standardization of Serbian Language", which has delegates from academic and government institution from Serbia, Republic of Srpska and Montenegro. And, the only part of Serbian language politics which has almost concensus is that Serbian language has two standards: ekavian and iyekavian. Alphabets are the problem in politics, standards are not.

Children, tinagers and students from Republic of Srpska and Montenegro are learning iyekavian variant and differences in texts are bigger then in two main variants of English language. You can have a lot of articles in English without diferences like "kilometer-kilometre" or such. But, it is probabbly that you will find if the text is iyekavian or ekavian in the first sentence of article. However, there are no misunderstandings between iyekavian and ekavian speaker.

As I see, implementation of ekavian/iyekavian is more easy then implementation of transliteration between Cyrillic and Latin alphabet and vice versa. And I am sure that you can keep variants as one text inside of database.

Also, I don't care about Serbian nationalism. I care about live attributes of Serbian culture. And Wikipedia is the place which can implement them. If people here have good will.

Pablo Saratxaga

5:19 p.m.

Kaixo!

(I lost the thread, so I'm replying here)

On Wed, Apr 13, 2005 at 10:55:14PM +0200, Milos Rancic wrote:

...

...
While I recognize that there are long established traditions favouring the two script approach to Serbian, I think that the ekavian/iyekavian distinction is going too far. It makes the idea of a single Serbian

is that difference only with the letter "e"? If yes, why can't it be decided that some people pronounce it as "je" (like they do in Russian, they just write "е" and not "йе") and some other people pronounce it as "e", and keep a single way of writting?

or the other way around, always write the "j", but some people not pronounce it.

having two different writting systems just for that (which means four different writting systemns in total) seems unjustified;

Ki ça vos våye bén, Pablo Saratxaga

http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Catalan or Esperanto] [min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]

Nikola Smolenski

11:18 p.m.

On Thursday 14 April 2005 01:19, Pablo Saratxaga wrote:

...

On Wed, Apr 13, 2005 at 10:55:14PM +0200, Milos Rancic wrote:

...
...
While I recognize that there are long established traditions favouring the two script approach to Serbian, I think that the ekavian/iyekavian distinction is going too far. It makes the idea of a single Serbian

is that difference only with the letter "e"? If yes, why can't it be decided that some people pronounce it as "je" (like they do in Russian, they just write "е" and not "йе") and some other people pronounce it as "e", and keep a single way of writting?

or the other way around, always write the "j", but some people not pronounce it.

No, unfortunately, there are numerous es and jes which are same in both dialects. This is explained nicely in [[Serbo-Croatian language]] (at least, it was the last time I've checked).

...

having two different writting systems just for that (which means four different writting systemns in total) seems unjustified;

Actually, there are only two writing systems, but as they are phonetic, same words are written differently because people are telling them differently.

Milos Rancic

14 Apr 14 Apr

7:01 a.m.

...

is that difference only with the letter "e"? If yes, why can't it be decided that some people pronounce it as "je" (like they do in Russian, they just write "е" and not "йе") and some other people pronounce it as "e", and keep a single way of writting?

There are some more differences...

In 12th century so called "new dialects of Serbian/Croatian language" are devided over different reflexes of old vocal "jat" (capital letter "Ѣ", small letter "ѣ" in Cyrillic, or Ě / ě in Latin) into: ekavian, iyekavian and ikavian variant. Ekavian reflex is almost always "e" (sometimes it is "i"), ikavian reflex is always "i", and iyekavian reflexes are in general "ye" or "iye", but often "e" and rare "i".

After 12th century there some consonant changes was happened. When iyekavian reflex "ye" stayed after consonants 'l' or 'n', string "lje" (lye) or "nje" (nye) became two, not three volwes and "lj" and "nj" was not anymore "l+j" or "n+j", but "lj" and "nj" as soft "l" and "n".

So, differences became bigger, but not so big. Ekavians and iyekavians are closer between themselfs then with any other language speakers. For example, when iyekavian Croat from Zagreb or iyekavian Serb from Banja Luka talks with ekavian Serb from Belgrade, they will not have any problem in communication.

However, in writting systems there are some differences. Not only "mleko-mlijeko", but, for example, word for "notepad" in ekavian variant is "beležnica", but in iyekavian it is "bilježnica". Old form was "bělěžnica" (before "jat" transformation, after other changes in language). Ekavians changed "ě" into e, but iyekavians changed first "ě" into 'i' and second into "ye". As /j/ made 'l' soft, "lj" combination dosn't represent two volwes, but one. Around 20.000 lexemes (and around 200.000 forms/words) are affected with "jat" transformation.

Vuk Karadzic and Ljudevit Gaj sancionied only ekavian and iyekavian, but not ikavian. So, today Croatian and Bosnian have only iyekavian varian (a lot of Croats and some of Bosniaks are ikavian) and Serbian has iyekavian and ekavian.

...

or the other way around, always write the "j", but some people not pronounce it.

having two different writting systems just for that (which means four different writting systemns in total) seems unjustified;

There are a lot of possibilities for solving that problem. However, you and me are not the persons who are making those decisions. Also, I think that a lot of Serbian linguists don't think a lot about complicated situation with standards. A lot of them would use Cyrillic or Latin (or both if they don't care about alphabet) and one dialect. Ortography of Serbian Language is written only in Cyrillic in two variants (ekavian and iyekavian)...

The other question is that Serbian (Croatian and Bosnian) language have (with Finish and, I think, Estonian) the most phonetic alphabets. This is two centuries tradition of Serbian language and I am the first who would be against morphological writing system (as well as other linguists). Morphological solution is introduction of "jat" into ortography (Ěě in Latin, Ѣѣ in Cyrillic), but a lot of Ekavians would not know where to put it (not all 'e' came from 'jat'); actualy, only Ekavians with high linguistic education know where to put 'jat' and where not to put.

About other way, I said that Karadzic's nazi regime tried to do that.

I can say that cultures are complicated. They are not based on bits and formal logic. Hindi and Urdu are the same linguistic language, as well and Serbian, Croatian and Bosnian. However, those are different cultures; and all of that cultures has set of their own rules. 10 million Serbs has four writting systems. But, 500.000 of Lusatian Serbs has two writting systems :) Or, almost 5 million inhabitans of Papua New Guinea has around 800 different (!) languages!

Heli Retzek

5 May 5 May

6:06 a.m.

New subject: Word 2 Wikipedia Macros

I have created some simple Winword Macro that can convert a win-word file into wikipedia format.

http://www.homeopathy.at/wiki/index.php/Word2Wiki

Works rather well and serves most of my needs.

Following format-conversions are supportet:

Bold, italic, underline (any combination) Dotted lists Numbered lists Paragraph 2 <br> Headers Colored text Simple tables

Unsupported yet Font-size Super/subscript Background-color Footnotes Pictures Sofisticated Table-formats ?????

liebe Grüsse Heli Retzek

===================== Dr.med. Helmut B Retzek Oberbleichfleck 2 A-4840 Vöcklabruck / AUT +43-7672-23700 (priv -11, Fax -12) http://www.homeopathy.at

Thomas Gries

1:51 p.m.

New subject: Word 2 Wikipedia Macros

Thanks Heli, it looks nice and very useful ! Tom

Heli Retzek schrieb:

...

I have created some simple Winword Macro that can convert a win-word file into wikipedia format.

http://www.homeopathy.at/wiki/index.php/Word2Wiki

Works rather well and serves most of my needs.

Following format-conversions are supportet:

Bold, italic, underline (any combination) Dotted lists Numbered lists Paragraph 2 <br> Headers Colored text Simple tables

Unsupported yet Font-size Super/subscript Background-color Footnotes Pictures Sofisticated Table-formats ?????

liebe Grüsse Heli Retzek

===================== Dr.med. Helmut B Retzek Oberbleichfleck 2 A-4840 Vöcklabruck / AUT +43-7672-23700 (priv -11, Fax -12) http://www.homeopathy.at

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Thomas Gries

26 May 26 May

12:16 p.m.

New subject: Word 2 Wikipedia Macros

Hallo Heli

vielleicht möchtest Du einen Link zu Deiner Seite auf _dieser_ hinzufügen:

http://meta.wikipedia.org/wiki/Word_macros

Mit freundlichen Grüßen

Thomas Gries Berlin 12159 Berlin Tel.: + 49 30 859 28 73 Fax: + 49 30 85 07 55 92 mobil: + 49 179 - 29 00 691 mailto:mail@tgries.de

Heli Retzek schrieb:

...

I have created some simple Winword Macro that can convert a win-word file into wikipedia format.

http://www.homeopathy.at/wiki/index.php/Word2Wiki

Works rather well and serves most of my needs.

Following format-conversions are supportet:

Bold, italic, underline (any combination) Dotted lists Numbered lists Paragraph 2 <br> Headers Colored text Simple tables

Unsupported yet Font-size Super/subscript Background-color Footnotes Pictures Sofisticated Table-formats ?????

liebe Grüsse Heli Retzek

===================== Dr.med. Helmut B Retzek Oberbleichfleck 2 A-4840 Vöcklabruck / AUT +43-7672-23700 (priv -11, Fax -12) http://www.homeopathy.at

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Nikola Smolenski

13 Apr 13 Apr

4:56 p.m.

On Wednesday 13 April 2005 15:47, Milos Rancic wrote:

...

"And now, something completely different!" :) Serbian language has two standard alphabets and two standard variants: "ekavian" (standard in Serbia) and "iyekavian" (standard in Republic of Srpska and Montenegro). Is it possible to make some markup extension at the character (and group of characters) level?

Serbian word for "milk" is in ekavian "mleko", but in iyekavian it is "mlijeko". So, with some markup can bi written like "ml{e|ije}ko"; "bred" is "hleb" in ekavian and "hljeb" in iyekavian, so it should be written like "h{l|lj}eb". No one should be forced to use markup, but there are some people at sr: who would do that.

I propose reintroduction of Yat :)

7164

Age (days ago)

7208

Last active (days ago)

wikitech-l@lists.wikimedia.org

12 comments

8 participants

tags (0)

participants (8)

Heli Retzek
Milos Rancic
Neil Harris
Nikola Smolenski
Pablo Saratxaga
Ray Saintonge
Thomas Gries
zhengzhu