Mention of the software, I have some questions here:
Yes, Traditional Chinese and Simplified Chinese are originated from the same language. But now it had grown into two different language. The problem of translating Simplified Chinese to Traditional Chinese is not only about characters (While translating from Simplified character to Traditional character is already a headache).Many of the phrases and idioms are different.
I wonder how software can solve this kind of problem. The only way I think of to solve this problem is to redirect it one by one. Since both SC and TC are living languages, this kind of phrases can grow non-stop, making redirecting a hardwork.
Therefore, I think splitting into zh-tw isn't really too bad an idea. Traditional Chinese users can build up their own database, while some of the items can redirect to zh wp.
--------------------------------- Yahoo!奇摩Messenger6.0 即時有趣的即時通訊世界 立即下載更新最新版!
Well, having read through a lot of the discussion here, I must say the complexities of this issue fascinate me - most aspects of language do. And although I didn't even know of the existence of the two writing systems, I am now going to wade into the discussion from a technical and just generally interested point of view.
One of the key issues seems to be to what degree the two writing systems differ: are they just different representations of the same language, of different dialects, or do they represent fundamentally different languages?
For instance, 簡睦旼 mugua_q0_0p@yahoo.com.tw wrote:
Yes, Traditional Chinese and Simplified Chinese are originated from the same language. But now it had grown into two different language. The problem of translating Simplified Chinese to Traditional Chinese is not only about characters (While translating from Simplified character to Traditional character is already a headache).Many of the phrases and idioms are different.
Others have suggested that "most" words and phrases are the same, just written differently, with exceptions in specific areas. Henry H. Tan-Tenn gives a specific example:
... Will PRC users accept that a computer is known as an "electronic brain" in Taiwanese Mandarin, and not as a "calculator"? Will a Taiwanese Mandarin (or Japanese) user accept that what looks like the term for "calculator" often refers to a "computer" in PRC Mandarin? Clearly such differences can not always be avoided, and one or another or both have to be used without affecting intelligibility for one of the user groups. The evolution toward mutual textual intelligibility (through learning) and mutual acceptability (through tolerance and/or compromise) might *eventually* contribute to some kind of International Mandarin. Probably there are already transnational business incentives to do so.
Of course, this may not be a representative example of the general kind of difference, but it suggests two things to me:
1) The differences (beyond character mapping) are not particularly major. The comparison that springs to mind is US vs UK English: most of the differences amount to spelling conventions, such as color/colour etc, but there are some terms that are quite different, like sidewalk/pavement. A good example is "jelly": in the UK, "jelly" means a kind of pudding that in the US is generally called "Jell-O"; in the US, "jelly" means a fruit spread that in the UK is generally called "jam". Things like this do cause occasional confusion, and even friction, in international English projects, but en.wikipedia: doesn't seem to suffer a great deal from trying to appeal to both audiences. Now, I'm sure this comparison isn't perfect, but from what people have said so far, the Chinese issue is far more similar to en-gb vs en-us than it is to, say, en vs jp (an example somebody gave where it would be very hard to meaningfully merge the projects).
2) There is a tendency to equate the different dialects of Chinese with the different writing systems, when the two issues may in fact be more or less orthogonal. That is, the Taiwanese use of "electronic brain" for "computer" is a *dialect* issue, which probably exists in speech as well as writing, and there are probably far more dialects with such differences than there are writing systems. Such differences occur in all languages that have diverse populations of speakers: in English, we have not only US and UK variants, but Australian, Irish, Scottish, 'Black English' ('Ebonics'), etc etc. If the differences between dialects/variants of Chinese are genuinely more complex than these, then this needs to be discussed *independent* of the discussion of writing systems.
I understand that the issues necessarily overlap, but if there are multiple dialects/variants that [can] use the same writing system, and multiple writing systems that [can] be used fo the same dialect/variant then splitting the project based on one variable won't solve the problems of the other - you will either have a single-charset project where multiple dialects still have to coexist, or a single-dialect project where multiple charsets still have to coexist.
Well, that's added a lot of words and probably not many ideas to the debate, but I thought I'd throw my thoughts into the mix...
Thanks to Rowan for a good summary! I think most Chinese speakers would agree that the difference between Simplified and Traditional Chinese is a minor one. On the other hand, a pure automatic translation between the two isn't likely to work, either, no matter how sophisticated the technique is. This is generally true for any machine translation, I think.
However, I think the wikipedia project provides a unique platform to tackle this problem, because there are lots of PEOPLE working together to provide quality content. The solution has been suggested several times here before: we use a program to do a simple automatic translation, using some sort of mapping tables. Since there are only minor difference between the variants, we can expect that in general there are not a lot of errors in the translation. We can then add a special wiki tag in the wikitext to specify how to correct those errors. Since most (if not all) articles in wikipedia is going to be edited again and again anyway, by many different people, to add content, correct errors, etc, why not put in the little bit of extra effort to correct these translation errors as well?
Thus I have started to implement this idea. I am running a test site at http://s87257573.onlinehome.us/wiki, to see how far this can go. It is not going to be easy to make the initial transition--lots of editing involved. But who says writing an encyclopedia should be easy? I welcome all to visit the test site, to try editing a few articles, and see the results. (There is a short English description there as well.)
zhengzhu wrote:
However, I think the wikipedia project provides a unique platform to tackle this problem, because there are lots of PEOPLE working together to provide quality content. The solution has been suggested several times here before: we use a program to do a simple automatic translation, using some sort of mapping tables. Since there are only minor difference between the variants, we can expect that in general there are not a lot of errors in the translation. We can then add a special wiki tag in the wikitext to specify how to correct those errors. Since most (if not all) articles in wikipedia is going to be edited again and again anyway, by many different people, to add content, correct errors, etc, why not put in the little bit of extra effort to correct these translation errors as well?
Just a question out of curiosity about how you handle this: what's the base language, or is there one? Is the primary version of a document in Simplified, and then there are annotations for how to correctly translate it to Traditional (i.e. [simplified character|proper traditional character]), or is it the other way around, or are both Simplified and Traditional equally base languages?
I'm asking because I was trying to think of a way to do this earlier with two other similar languages, and couldn't think of a good one due to things not being the same in both directions. To take a made-up example, say that Simplified 'a' can be either 'b' or 'c' in Traditional, while Traditional 'b' can be either 'a' or 'd' in Simplified.
Then you could have a "simplified-base" form, that is: [a|b] [a|c] d Or an equivalent "traditional-base" form, that is: [b|a] [c|a] [b|d]
The problem with using only one being that if a traditional user edited the translated "simplified-base" form, there would be no annotation on the 'b', because in the simplified->traditional direction it wasn't needed.
Or does this never happen in Chinese? I realize the example is somewhat confusing, but I couldn't come up with a clear one. The plain-English version of the question is: what do you do when there are one-to-many mappings in both directions? Or are the one-to-many mappings in Chinese only from Simplified to Traditional, and never in the other direction? (Even if so, that would still leave it as an interesting question for other language pairs.)
-Mark
Just a question out of curiosity about how you handle this: what's the base language, or is there one? Is the primary version of a document in Simplified, and then there are annotations for how to correctly translate it to Traditional (i.e. [simplified character|proper traditional character]), or is it the other way around, or are both Simplified and Traditional equally base languages?
Glancing at the current test implementation, I gather that neither has 'precedence': to put a manual translation, you say {-zh-cn one version zh-tw the other-}. Wether the mappings are one-to-many in one direction, the other, or both, is not a problem: you simply define what you want displayed, in that particular case, in both versions.
What I'm not clear on, having not looked in any depth, is how the article is actually *stored*. The explanation seems to imply that the characters are simply recognised as being either: a) in the desired writing system; no action needed b) in the non-desired writing system; automatic translation required or c) marked up as a special case; version chosen to match preference as per special syntax
I may be wrong, but if I'm right this obviously no use to the more general case of languages/dialects. For that, you'd probably need to store which language the 'original' was in in the database, and then convert based on that. Although then you'd have the problem of changes that weren't easily translatable back to that base, wouldn't you? i.e. base is LangA, a LangB user makes a change, but that change is ambiguous in LangA; how is that change recorded? Similarly, if a naive LangB user "corrects" the automated translation, they may end up creating an error in the LangA document, because they overwrote the original rather than adding special syntax. Ouch. It's more complicated than I expected, unless that's just cos I'm hungry... ;)
The "dialect" question is a very difficult one to answer and the creation of zh-min-nan: has already made ripples in the zh: community. The difference between Minnan and other "dialects" though is that, as far as I'm aware, none of the other Chinese dialects/Sinitic languages has a large movement to switch to a different writing system.
First and foremost this problem could be looked at in terms of Cantonese.
Modern Cantonese actually has two different versions, one that is just reading text written for Mandarin speakers but with Cantonese readings, the other being using Cantonese grammar and vocabulary words that Cantonese has but Mandarin doesn't.
Until very recently the latter had the higher status in Hong Kong and Macau, however upon reunification the former gained the higher status.
Most Cantonese speakers, even if they don't know Mandarin, can read texts written by a Mandarin speaker with little difficulty, but much of the sentences are not how they would say them in everyday speech.
Then there is also an issue with Classical Chinese which is very different from modern Mandarin. Until very recently any sort of reference work like an encyclopedia would've been written in Classical Chinese which was the literary language.
There may be some movement to start a Classical Chinese Wikipedia but if there is it must be very small.
However Classical Chinese sentences often seem more natural in Cantonese or Hakka or other Southern dialects than do the equivalents in written Mandarin.
Also if you were to convert zh-min-nan: into Chinese characters it would become apparent very quickly that it wasn't Mandarin, especially because Mandarin uses such words as 的 (de) which many people say is "bastardized classical Chinese" because originally 的 was created exclusively for writing Mandarin, the character properly used for Taiwanese and Classical Chinese is 之 (as you can see 之 is a basic character, but 的 has two different parts).
--金俊書/Mark
On Tue, 14 Sep 2004 20:43:29 +0100, Rowan Collins rowan.collins@gmail.com wrote:
Just a question out of curiosity about how you handle this: what's the base language, or is there one? Is the primary version of a document in Simplified, and then there are annotations for how to correctly translate it to Traditional (i.e. [simplified character|proper traditional character]), or is it the other way around, or are both Simplified and Traditional equally base languages?
Glancing at the current test implementation, I gather that neither has 'precedence': to put a manual translation, you say {-zh-cn one version zh-tw the other-}. Wether the mappings are one-to-many in one direction, the other, or both, is not a problem: you simply define what you want displayed, in that particular case, in both versions.
What I'm not clear on, having not looked in any depth, is how the article is actually *stored*. The explanation seems to imply that the characters are simply recognised as being either: a) in the desired writing system; no action needed b) in the non-desired writing system; automatic translation required or c) marked up as a special case; version chosen to match preference as per special syntax
I may be wrong, but if I'm right this obviously no use to the more general case of languages/dialects. For that, you'd probably need to store which language the 'original' was in in the database, and then convert based on that. Although then you'd have the problem of changes that weren't easily translatable back to that base, wouldn't you? i.e. base is LangA, a LangB user makes a change, but that change is ambiguous in LangA; how is that change recorded? Similarly, if a naive LangB user "corrects" the automated translation, they may end up creating an error in the LangA document, because they overwrote the original rather than adding special syntax. Ouch. It's more complicated than I expected, unless that's just cos I'm hungry... ;)
-- Rowan Collins BSc [IMSoP]
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
OK, I have a proposal here which I think may solve the problem (I'm not 100% sure about software implementation though).
It is a mixed solution, but I think it shares the good parts of both solutions.
There would be two separate subdomains, http://zh-tw.wikipedia.org/ and http://zh-cn.wikipedia.org/. If you visit the first, you will see the UI in Traditional, and if you visit the latter the UI will be in Simplified.
Both of these can share the same database, with automatic conversion occuring.
I propose to store all text in Traditional but convert it to Simplified (perhaps with some sort of caching so articles do not have to be re-generated each time) because TC>SC conversion is less ambiguous than SC>TC conversion. If somebody adds text to an article but they are typing in SC, it will be converted to TC when it adds it to the database. In the edit window even though, text will appear as whichever domain you are at. Titles of articles should be converted too. If a mistake is made in conversion when a Simplified text is added to the database, eventually somebody browsing at http://zh-tw.wikipedia.org/ will notice this error and hopefully fix it. In the mean time this error won't cause any problems on zh-cn because it will convert back the same way.
Would it be extremely difficult to have two separate Wikipedias use the same database but use conversion before displaying on the client-side and before new text is added to the database?
Of course, there would be a link on the sidebar that always gave you the option to switch to the other variety, except it would be displayed separately from normal Interwiki links.
--金俊書/Mark
On Tue, 14 Sep 2004 17:09:43 -0700, Mark Williamson node.ue@gmail.com wrote:
The "dialect" question is a very difficult one to answer and the creation of zh-min-nan: has already made ripples in the zh: community. The difference between Minnan and other "dialects" though is that, as far as I'm aware, none of the other Chinese dialects/Sinitic languages has a large movement to switch to a different writing system.
First and foremost this problem could be looked at in terms of Cantonese.
Modern Cantonese actually has two different versions, one that is just reading text written for Mandarin speakers but with Cantonese readings, the other being using Cantonese grammar and vocabulary words that Cantonese has but Mandarin doesn't.
Until very recently the latter had the higher status in Hong Kong and Macau, however upon reunification the former gained the higher status.
Most Cantonese speakers, even if they don't know Mandarin, can read texts written by a Mandarin speaker with little difficulty, but much of the sentences are not how they would say them in everyday speech.
Then there is also an issue with Classical Chinese which is very different from modern Mandarin. Until very recently any sort of reference work like an encyclopedia would've been written in Classical Chinese which was the literary language.
There may be some movement to start a Classical Chinese Wikipedia but if there is it must be very small.
However Classical Chinese sentences often seem more natural in Cantonese or Hakka or other Southern dialects than do the equivalents in written Mandarin.
Also if you were to convert zh-min-nan: into Chinese characters it would become apparent very quickly that it wasn't Mandarin, especially because Mandarin uses such words as 的 (de) which many people say is "bastardized classical Chinese" because originally 的 was created exclusively for writing Mandarin, the character properly used for Taiwanese and Classical Chinese is 之 (as you can see 之 is a basic character, but 的 has two different parts).
--金俊書/Mark
On Tue, 14 Sep 2004 20:43:29 +0100, Rowan Collins
rowan.collins@gmail.com wrote:
Just a question out of curiosity about how you handle this: what's the base language, or is there one? Is the primary version of a document in Simplified, and then there are annotations for how to correctly translate it to Traditional (i.e. [simplified character|proper traditional character]), or is it the other way around, or are both Simplified and Traditional equally base languages?
Glancing at the current test implementation, I gather that neither has 'precedence': to put a manual translation, you say {-zh-cn one version zh-tw the other-}. Wether the mappings are one-to-many in one direction, the other, or both, is not a problem: you simply define what you want displayed, in that particular case, in both versions.
What I'm not clear on, having not looked in any depth, is how the article is actually *stored*. The explanation seems to imply that the characters are simply recognised as being either: a) in the desired writing system; no action needed b) in the non-desired writing system; automatic translation required or c) marked up as a special case; version chosen to match preference as per special syntax
I may be wrong, but if I'm right this obviously no use to the more general case of languages/dialects. For that, you'd probably need to store which language the 'original' was in in the database, and then convert based on that. Although then you'd have the problem of changes that weren't easily translatable back to that base, wouldn't you? i.e. base is LangA, a LangB user makes a change, but that change is ambiguous in LangA; how is that change recorded? Similarly, if a naive LangB user "corrects" the automated translation, they may end up creating an error in the LangA document, because they overwrote the original rather than adding special syntax. Ouch. It's more complicated than I expected, unless that's just cos I'm hungry... ;)
-- Rowan Collins BSc [IMSoP]
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
On Tue, 14 Sep 2004 20:31:11 -0700, Mark Williamson node.ue@gmail.com wrote:
There would be two separate subdomains, http://zh-tw.wikipedia.org/ and http://zh-cn.wikipedia.org/. If you visit the first, you will see the UI in Traditional, and if you visit the latter the UI will be in Simplified.
I think you'll have widespread agreement this would be a good idea. Not sure if the localization support in MediaWiki supports this configuration right now, but perhaps the developers could make it happen. (In fact, it could support the complete matrix of browsing Wikipedia of language X using the interface of language Y.)
I propose to store all text in Traditional but convert it to Simplified (perhaps with some sort of caching so articles do not have to be re-generated each time) because TC>SC conversion is less ambiguous than SC>TC conversion.
This is a tougher proposition, since most of the content and contributors use the Simplified form. However, I'd encourge you to broach these subjects at the wikizh-l@wikimedia.org mailing list as well, or at least Cc: them on this.
-Andrew (User:Fuzheado)
On Tue, 14 Sep 2004 20:31:11 -0700, Mark Williamson node.ue@gmail.com wrote: <snip>
I propose to store all text in Traditional but convert it to Simplified (perhaps with some sort of caching so articles do not have to be re-generated each time) because TC>SC conversion is less ambiguous than SC>TC conversion. If somebody adds text to an article but they are typing in SC, it will be converted to TC when it adds it to the database. In the edit window even though, text will appear as whichever domain you are at. Titles of articles should be converted too. If a mistake is made in conversion when a Simplified text is added to the database, eventually somebody browsing at http://zh-tw.wikipedia.org/ will notice this error and hopefully fix it. In the mean time this error won't cause any problems on zh-cn because it will convert back the same way.
This is more or less the concept I was mulling over as a very general solution, but I realised that it does have a big disadvantage: naive users 'correcting' the translation may simply shift the error into the opposite version. Or, more specifically, there is no way of distinguishing a translational correction from a factual one. For example:
Say you have a database in English, but with automated conversion to a dialect, we'll call it Blinglish. The English database contains the text "...while eating an apple...", and this is viewed by a Blinglish user. They replace the word 'apple' (in the Blinglish version) with 'orange'. The software now has no way of knowing whether the use is saying that 'orange' is the Blinglish word for 'apple', or whether the Blinglish user is correcting a fact, and the English version should be updated to say 'orange'.
Obviously, the translation corrections *should* be labelled using special markup, but the majority of users find special markup very hard to learn, and huge numbers of users pass through who have no idea how to use such things. In order to encourage them to return and contribute more, we need to not only make the system work *despite* them, but to actively fit them into it.
If, to continue my example, we translate 'orange' back to English, when it is in fact supposed to be an idiomatic translation, another user may come along on the English site and correct it back to 'apple'. The Blinglish version will then be in its original state, and the cycle will continue until a more experienced user spots the ambiguity and marks it up appropriately. A waste of everyone's time, and a definite turn-off for the casual users whose changes keep disappearing.
If we can rely on a majority of the users understanding more than one of the languages involved, we could more-or-less avoid this by providing some obvious mechanism for saying "this change is because of a translation issue", that even technophobes can use. But anyone that only understands one version will not know themselves whether it is a translation issue - only that it is, within the version they are looking at, a mistake...
This is more or less the concept I was mulling over as a very general solution, but I realised that it does have a big disadvantage: naive users 'correcting' the translation may simply shift the error into the opposite version. Or, more specifically, there is no way of distinguishing a translational correction from a factual one.
Completely agree. That why I think an explicit mark up that says Simplified "apple" IS Traditional "orange" is necessary.
On the UI issue, it seems feasible to have two UIs coexisting together without running two separate sites: we can store two versions of every message key in the cache or the database. For example, for "mainpage", we will have "mainpage_cn" that maps to the Simplified title of the mainpage, and "mainpage_tw" that maps to the Traditional title of the mainpage. One mainpage can then be the redirect of the other. When rendering the UI, we will look for things with _cn in the message cache if the language preference is zh_cn, and _tw for zh_tw. I hacked up some code to do this and it is now runing on the test site, http://s87257573.onlinehome.us/wiki/. I am not sure if I got the message caching part right, because I currently don't have a way to test it. I think right now the messages are coming from the database.
--------------------- gmail.com at zhengzhu
What is wrong with *my* solution?
--Jin Junshu/Mark
On Thu, 16 Sep 2004 00:27:48 -0400, zhengzhu zhengzhu@gmail.com wrote:
This is more or less the concept I was mulling over as a very general solution, but I realised that it does have a big disadvantage: naive users 'correcting' the translation may simply shift the error into the opposite version. Or, more specifically, there is no way of distinguishing a translational correction from a factual one.
Completely agree. That why I think an explicit mark up that says Simplified "apple" IS Traditional "orange" is necessary.
On the UI issue, it seems feasible to have two UIs coexisting together without running two separate sites: we can store two versions of every message key in the cache or the database. For example, for "mainpage", we will have "mainpage_cn" that maps to the Simplified title of the mainpage, and "mainpage_tw" that maps to the Traditional title of the mainpage. One mainpage can then be the redirect of the other. When rendering the UI, we will look for things with _cn in the message cache if the language preference is zh_cn, and _tw for zh_tw. I hacked up some code to do this and it is now runing on the test site, http://s87257573.onlinehome.us/wiki/. I am not sure if I got the message caching part right, because I currently don't have a way to test it. I think right now the messages are coming from the database.
gmail.com at zhengzhu
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
I am no opinion on zh-min-nan but I have a few things to say about your message.
On Tue, 14 Sep 2004 17:09:43 -0700, Mark Williamson node.ue@gmail.com wrote:
First and foremost this problem could be looked at in terms of Cantonese.
Modern Cantonese actually has two different versions, one that is just reading text written for Mandarin speakers but with Cantonese readings, the other being using Cantonese grammar and vocabulary words that Cantonese has but Mandarin doesn't.
Until very recently the latter had the higher status in Hong Kong and Macau, however upon reunification the former gained the higher status.
Most Cantonese speakers, even if they don't know Mandarin, can read texts written by a Mandarin speaker with little difficulty, but much of the sentences are not how they would say them in everyday speech.
I am a Hongkonger and I speak Cantonese. And I think what you're talking is really messed up. We write "written Chinese" (書面語) not what you called "reading text written for Mandarin speakers". And this "written Chinese" is the same "written Chinese" in Beijing or in Shanghai or in any part of China. Chinese people write Chinese in its common form: before the New Written Language Movement (AD 192x - 193x) this common form is "wenyian" (or Classical Chinese), and after the movement this common form is "baihua" (Modern Chinese).
It is true that what we speak is not what we write. But no one actually write "Cantonese" except in very vulgar or very casual occasions, like when I'm IM-ing with my friends. We don't teach how to write Cantonese in school, and I think that not much people really know how to write correct Cantonese.
Then there is also an issue with Classical Chinese which is very different from modern Mandarin. Until very recently any sort of reference work like an encyclopedia would've been written in Classical Chinese which was the literary language.
Perhaps you're right if "until very recently" means several decades ago.
<snip>
On Tue, 14 Sep 2004 15:10:10 -0400, Delirium delirium@hackish.org wrote:
Just a question out of curiosity about how you handle this: what's the base language, or is there one? Is the primary version of a document in Simplified, and then there are annotations for how to correctly translate it to Traditional (i.e. [simplified character|proper traditional character]), or is it the other way around, or are both Simplified and Traditional equally base languages?
This is a problem in general, but I think a minor one for the case of Simplified/Traditional Chinese. In short, there is no "base" language, the wikitext can in fact be mixed, using both Simplified and Traditional characters. Here is the long explanation:
Out of about 5000 to 6000 commonly used Chinese characters, about half of them (~2600) have different Simplified/Traditional forms. However the difference is very regular; there are pretty clear rules on how one maps to another. You can think of this in English as, for example, always change Simplified character begining with "sh" to "ch" to get the Traditional form (i.e. ship -> chip, sheep->cheep, etc.) There are a few exceptions, but most of the time these rules work. As a result, there would be little difficulty for a Chinese editor to recognize both Simplified and Traditional characters, regardless of his or her native language. Plus, the editor must have read the (automatically) translated article first, which should contain far less unfamilar characters. I imagine it would be no more difficult to locate the place one wants to correct than say, to make changes to a table.
wikipedia-l@lists.wikimedia.org