In most cases titles are converted using the same conversion system, i.e. using the conversion table and do a strtr(). For wierd situations, there is also support for manually specifying title conversion inside the article body, using this syntax: -{T|zh-cn:foo; zh-tw: bar}-
I am thinking about more general solution: To make database table with exceptions. And, more general, to make some kind of interaction with Wiktionary.
Here is an example: Let's say that in the conversion table, "foo" in zh-cn is converted to "bar" in zh-tw and vice-versa. Now someone writing in zh-cn wrote an article titled "foo". When someone with zh-tw preferred sees the article, "bar" will be shown as the article title. Further, say someone using zh-tw edited some article which has a link [[bar]]. The system will identify that the article "foo" should be used for linking, if "bar" is not already created as a redirect.
What do you keep in database? Simplified, traditional or both?
btw, you should be able to change the interface at zh after you register an account;)
I remember that I was looking few minutes at left up corner of MS Excel when I tried to find position of "File" in Hebrew MS Office :) (it is at right up corner). The situation with Chinese interface is similar :)
I saw a couple of days ago that if I click on Traditional Chinese, I'll get "ugly" link with parameter "variant=zh-tw". Is it possible that Simplified Chinese has URL in Simplified and Traditional Chinese in Traditional Chinese? Or mod_rewrite redirection:
http://zh.wikipedia.org/wiki/<something in Traditional Chinese> is shown, but http://zh.wikipedia.org/wiki/<something in Simplified Chinese>...?variant=zh-tw is read?
Indeed. I have always anticipated that the Chinese system can be generalized to other languages. Most of the code for the Chinese system is not specifically tied to the Chinese language, and some code refactoring can be done to provide better support for different languages. Please watch CVS HEAD for the next couple weeks for this to happen.
I knew that Chinese have two alphabets, but I didn't have in my mind that problems are similar to Serbian :) Of course, I found it a couple of months ago...
- Also, we should try to make system clever: Some formal and some
statistic methods can help in recognizing should we transliterate something or not (i.e.: if system find some non-Serbian Cyrillic letters, it should not transliterate it into Latin and vice versa).
That's certainly doable within the current system framework, but will require more specialized algorithms.
Inside of my extension to pywikipedia bot (http://millosh.org/software/ltafos/pos/), I have statistic guesser: algorithm gueses distance between two texts (something like so called edit distances, but stochastic, so it can compare texts in real time, not only words and phrases). I am using it to guess if page is in Serbian or not. However, it can be used (in future forms) for other kinds of stochastic guessing.