That's the thing with the ambiguity. That's the reason I'm suggesting articles be stored in Traditional and converted into Simplified on-the-run.
Traditional-to-Simplified conversions will have a much much much smaller number with ambiguous conversions than Simplified-to-Traditional. If a Simplified user writes additional content in Simplified, but parts of it are converted incorrectly before being added to the database, then a special process will take effect:
Since the wrong character and the right character in Traditional are the same character in Simplified, it won't have any effect on Simplified users who will continue browsing it as it is. However, the error *will* show up to Traditional users, who will then correct it. This correction will have no effect on the apperance of the text to a Simplified user, but it will make it so it uses the correct character for a Traditional user.
This eliminates the need for special semantic markup.
--Jin Junshu/Mark
On Wed, 15 Sep 2004 13:51:54 +0100, Rowan Collins rowan.collins@gmail.com wrote:
On Tue, 14 Sep 2004 20:31:11 -0700, Mark Williamson node.ue@gmail.com wrote:
<snip> > I propose to store all text in Traditional but convert it to > Simplified (perhaps with some sort of caching so articles do not have > to be re-generated each time) because TC>SC conversion is less > ambiguous than SC>TC conversion. If somebody adds text to an article > but they are typing in SC, it will be converted to TC when it adds it > to the database. In the edit window even though, text will appear as > whichever domain you are at. Titles of articles should be converted > too. If a mistake is made in conversion when a Simplified text is > added to the database, eventually somebody browsing at > http://zh-tw.wikipedia.org/ will notice this error and hopefully fix > it. In the mean time this error won't cause any problems on zh-cn > because it will convert back the same way.
This is more or less the concept I was mulling over as a very general solution, but I realised that it does have a big disadvantage: naive users 'correcting' the translation may simply shift the error into the opposite version. Or, more specifically, there is no way of distinguishing a translational correction from a factual one. For example:
Say you have a database in English, but with automated conversion to a dialect, we'll call it Blinglish. The English database contains the text "...while eating an apple...", and this is viewed by a Blinglish user. They replace the word 'apple' (in the Blinglish version) with 'orange'. The software now has no way of knowing whether the use is saying that 'orange' is the Blinglish word for 'apple', or whether the Blinglish user is correcting a fact, and the English version should be updated to say 'orange'.
Obviously, the translation corrections *should* be labelled using special markup, but the majority of users find special markup very hard to learn, and huge numbers of users pass through who have no idea how to use such things. In order to encourage them to return and contribute more, we need to not only make the system work *despite* them, but to actively fit them into it.
If, to continue my example, we translate 'orange' back to English, when it is in fact supposed to be an idiomatic translation, another user may come along on the English site and correct it back to 'apple'. The Blinglish version will then be in its original state, and the cycle will continue until a more experienced user spots the ambiguity and marks it up appropriately. A waste of everyone's time, and a definite turn-off for the casual users whose changes keep disappearing.
If we can rely on a majority of the users understanding more than one of the languages involved, we could more-or-less avoid this by providing some obvious mechanism for saying "this change is because of a translation issue", that even technophobes can use. But anyone that only understands one version will not know themselves whether it is a translation issue - only that it is, within the version they are looking at, a mistake...
-- Rowan Collins BSc [IMSoP]
On Wed, 15 Sep 2004 22:58:48 -0700, Mark Williamson node.ue@gmail.com wrote:
Since the wrong character and the right character in Traditional are the same character in Simplified, it won't have any effect on Simplified users who will continue browsing it as it is. However, the error *will* show up to Traditional users, who will then correct it. This correction will have no effect on the apperance of the text to a Simplified user, but it will make it so it uses the correct character for a Traditional user.
This eliminates the need for special semantic markup.
--Jin Junshu/Mark
As someone pointed out before, the problem is not just on the character-to-character mapping. Some concepts are expressed entirely differently, for example, 电脑(electronic brain) vs. 计算机(calculator or computer). A second example will be translations of foreign names, for example, Croatia is translated in Mainland China as 克罗地亚,but 克罗埃西亚 in Taiwan. This kind of difference can be arbitrary, and will likely evolve along time. It is mainly this kind of difference that requires a special markup.
In fact, I think the character ambiguity is less of an issue, because most of them (although not all) can be distinguished by looking at the phrase that a character is in. Including phrases that contains ambiguous characters in the conversion table should eliminate most of the character ambiguity.
However I don't have enough data to proof this. I have implemented this idea on the test site, but since I myself am a Simplified user, I can't really tell if most of the ambiguity on the character level is eliminated or not. I urge Traditional Chinese user on this list to visit the test site, and proof read a few converted articles to see how well/bad the current implementation works. The test site is at http://s87257573.onlinehome.us/wiki/
On Thu, 16 Sep 2004 09:08:50 -0400, zhengzhu zhengzhu@gmail.com wrote:
As someone pointed out before, the problem is not just on the character-to-character mapping. Some concepts are expressed entirely differently, for example, 电脑(electronic brain) vs. 计算机(calculator or computer). A second example will be translations of foreign names, for example, Croatia is translated in Mainland China as 克罗地亚,but 克罗埃西亚in Taiwan. This kind of difference can be arbitrary, and will likely evolve along time. It is mainly this kind of difference that requires a special markup.
This does seem to be a different issue: there are bound to be multiple variations in the language over and above the character conversion, but these are *cultural*, and there's no guarantee [or I doubt there is] that there won't be people that write in Traditional, but use the vocabulary/whatever normally associated with Simplified. You can also be pretty sure that there are plenty of cases where there are more than two ways of saying something; or more than one way within mainland China, say (it's a big place, is it not?). As I said before, this is true of en:, but given that we can all basically understand each other, most people seem to consider trying to find a technical solution a waste of time.
If: a) there are genuinely *exactly two* dialects of Chinese [written] vocabulary, and users of each of these map *exactly* to users of each of the character systems or: b) we want to create a system that can store any number of related languages/dialects in one database: e.g. all the Scandinavian languages [and are prepared to support *more than two* versions of Chinese] then: we need to worry about differences in vocabulary/usage else: we can use Mark's simplified system as a special case for converting between Traditional and Simplified Chinese.
I could of course be entirely wrong, but that's how I read the situation.
wikipedia-l@lists.wikimedia.org