Chinese, traditional and simplified

List overview All Threads
Download

newer

older

How to get more cooperation...

Proposal for a Wiktionary proof of...

Gerard Meijssen

13 Sep 2004 13 Sep '04

5:37 p.m.

There is a big thing on the wikipedia-l about writing up Chinese. One thing I gleaned from this discussion is that zh-tw and zh-cn are used to indicate respectively traditional and simplified Chinese. As it is relevant to wiktionary to have both correct spellings, I propose to use these codes as well as the zh code to indicate Chinese words.

I hope someone has a good suggestion for Serbian, cyrillic and alphabetic. There are more language that are written in different charactersets. I am looking forward to suggestions.

Thanks, GerardM

Show replies by date

Andrew Dunbar

14 Sep 14 Sep

1:13 a.m.

--- Gerard Meijssen gerardm@myrealbox.com wrote:

...

There is a big thing on the wikipedia-l about writing up Chinese. One thing I gleaned from this discussion is that zh-tw and zh-cn are used to indicate respectively traditional and simplified Chinese. As it is relevant to wiktionary to have both correct spellings, I propose to use these codes as well as the zh code to indicate Chinese words.

I don't have references but I'm sure I've read this is not a good idea. Because all countries actually do use both scripts sometimes, because sometimes the countries use entirely different words for the same thing, so it doesn't always even come down to a choice of character. I think we'd be overriding a country code as a script code when what we really need is a script code. That would not be ambiguous. Something like zh-trad and zh-simp. Or zh-trd and zh-smp.

I think I've seen a page on Microsoft's site which does it this way somewhere but I doubt they really use it. They do have language numbers to take into account the different scripts though, including the Serbians below.

...

I hope someone has a good suggestion for Serbian, cyrillic and alphabetic.

Umm Cyrillic is still alphabetic. I think you meant Cyrillic and Latin. How about sr-cyr and sr-lat?

...

There are more language that are written in different charactersets. I am looking forward to suggestions.

The least obscure I can think of is Punjabi which is written in Gurmukhi, its own indic script; Shahmukhi, a derivation from the Urdu script which is itself a derivation of the Arabic script; and finally in Deva- nagari, the most common script in India. - But this is already quite obscure. Many former Soviet Republics have 2 or 3 scripts as well.

Andrew Dunbar (hippietrail)

...

Thanks, GerardM _______________________________________________ Wiktionary-l mailing list Wiktionary-l@Wikipedia.org

http://mail.wikipedia.org/mailman/listinfo/wiktionary-l

...

===== http://linguaphile.sf.net/cgi-bin/translator.pl http://www.abisource.com

___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com

Gerard Meijssen

5:21 a.m.

Andrew Dunbar wrote:

...

--- Gerard Meijssen gerardm@myrealbox.com wrote:

...
There is a big thing on the wikipedia-l about writing up Chinese. One thing I gleaned from this discussion is that zh-tw and zh-cn are used to indicate respectively traditional and simplified Chinese. As it is relevant to wiktionary to have both correct spellings, I propose to use these codes as well as the zh code to indicate Chinese words.

I don't have references but I'm sure I've read this is not a good idea. Because all countries actually do use both scripts sometimes, because sometimes the countries use entirely different words for the same thing, so it doesn't always even come down to a choice of character. I think we'd be overriding a country code as a script code when what we really need is a script code. That would not be ambiguous. Something like zh-trad and zh-simp. Or zh-trd and zh-smp.

I think I've seen a page on Microsoft's site which does it this way somewhere but I doubt they really use it. They do have language numbers to take into account the different scripts though, including the Serbians below.

In a wikipedia context, it is not a good idea as you want to bring people together. In a wiktionary context things are imho different as we do provide all (correct) words in all languages. In English, many words are spelled differently depending on it being en-us or en-uk or en-aus etc. The meaning of a word may be subtly different as well, so it is not always synonyms that we are talking about. With different scripts in one language you have words in a different script that are synonymous.

The pronounciation is also often different depending on where you come form. Patatoe, router etc

In a wiktionary you want to define the words and make it plain where the word comes from. All English variants can understand each other. It is up to wiktionary to allow for these differences. Consequently it is not only about script I realise. As Wikipedia already has zh-tw and zh-cn as codes, using them within wiktionary as well is reasonable. For Serbian, sr-cyr and sr-lat makes sense to me.

What I am not sure about is, how do we indicate the word as such; {{-xx-}} indicates a word in a language. I really want to keep it that way. Does following it up with {{xx-xx}} to indicate the relevant subset (characterset or regionality) as reasonable ?

Thanks, GerardM

...

...
I hope someone has a good suggestion for Serbian, cyrillic and alphabetic.

Umm Cyrillic is still alphabetic. I think you meant Cyrillic and Latin. How about sr-cyr and sr-lat?

...
There are more language that are written in different charactersets. I am looking forward to suggestions.

The least obscure I can think of is Punjabi which is written in Gurmukhi, its own indic script; Shahmukhi, a derivation from the Urdu script which is itself a derivation of the Arabic script; and finally in Deva- nagari, the most common script in India. - But this is already quite obscure. Many former Soviet Republics have 2 or 3 scripts as well.

Andrew Dunbar (hippietrail)

...
Thanks, GerardM _______________________________________________ Wiktionary-l mailing list Wiktionary-l@Wikipedia.org

Sabine Cretella

6:03 a.m.

...

What I am not sure about is, how do we indicate the word as such; {{-xx-}} indicates a word in a language. I really want to keep it that way. Does following it up with {{xx-xx}} to indicate the relevant subset (characterset or regionality) as reasonable ?

Hi,

As far as translations are concerned CAT-software uses the following way to distinguish different versions:

EN-GB EN-US EN-AU (don't remember English - New Zealand - I should need to look it up)

DE-DE DE-AT DE-CH For the different German versions.

There is indeed also diversification of four different chinese writings alsways indicated with ZH-... I don't know these, but I'll look them up.

When localising websites often simplyfied Chinese is choosen as it is read by Chinese people as well as by Taiwanese People. In case of Wiktionary I feel that simplyfied Chinese and traditional Chinese should maybe have two different wiktionaries as they are really very different.

So yes: the way you'd like to use to identify "sub-languages" is widely used and therefore makes sense.

Ciao, Sabine

...

Muke Tever

4:12 a.m.

On Mon, 13 Sep 2004 19:37:56 +0200, Gerard Meijssen gerardm@myrealbox.com wrote:

...

I hope someone has a good suggestion for Serbian, cyrillic and alphabetic. There are more language that are written in different charactersets. I am looking forward to suggestions.

The solution used so far is to link both the word in Cyrillic and its Latin transliteration. This is done with Belarusian (Cyrillic and Lacinka), and presumably for other languages that have had different official scripts in the past. I don't know how extensible such a solution would be for languages like Azeri which have changed scripts many times (it has had Arabic, Cyrillic, and two different Latin orthographies _officially_ over the past hundred years).

*Muke!

-- website: http://frath.net/ LiveJournal: http://kohath.livejournal.com/ deviantArt: http://kohath.deviantart.com/ FrathWiki, a conlang and conculture wiki: http://wiki.frath.net/

7236

Age (days ago)

7237

Last active (days ago)

wiktionary-l@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Andrew Dunbar
Gerard Meijssen
Muke Tever
Sabine Cretella