Steve Summit wrote:
Shinjiman and Happy Rabbit, I don't know about everybody else here, but I'd find this issue easier to think about if I understood a bit more about what you're trying to accomplish. Apologies for not knowing as much as I should about these issues.
My main question is this. Why do we need the extra mapping of the HTML lang tag? Why is it not adequate to set the value of $wgContLanguageCode appropriately for each wiki, so that it matches the language that the wiki's pages are written in?
The base problem is that various characters may appear differently when rendered in a Simplified Chinese font vs a Traditional Chinese font.
See http://en.wikipedia.org/wiki/Han_unification for some background on this issue.
Characters that differ semantically between the variants generally are assigned separate Unicode code points, but for characters that simply have different customary appearances the difference is left to the fonts.
Browsers will (or at least, may) pick appropriate default fonts based on the language specified in the web page.
Thus for example, a page which is marked as "zh" (defaulting usually to Simplified Chinese) but which contains text written for Traditional form may display oddly.
We have both Traditional and Simplified forms mixed on the Chinese-language Wikipedia. The wiki software is extended with two key abilities relevant to this:
1) When logged in, a user can select which language the user-interface text appears in. As well as completely different languages, they can choose a specific variant of Chinese (eg, Traditional instead of Chinese).
2) A visitor can select to have the text of a page automatically converted to their preferred variant (Simplified or Traditional) for more comfortable reading.
For a quick sample, here's the main page in default: http://zh.wikipedia.org/wiki/%E9%A6%96%E9%A1%B5
and here with all text converted to Simplified script forms: http://zh.wikipedia.org/w/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh-...
and here with all text converted to Traditional script forms: http://zh.wikipedia.org/w/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh-...
I don't know Chinese well enough to tell what's displaying right and what's displaying wrong, but depending on your browser and fonts situation, you might be able to see that some of the characters appear in different fonts.
At the moment no matter which variant you pick, the HTML still declares its language to be "zh", just plain Chinese. Shinjiman indicates that this defaults to Simplified Chinese in browsers that pick fonts depending on the declared language code; thus this might be displaying incorrectly when text is actually Traditional Chinese.
Now, it would appear to make reasonable sense to pick a more detailed language code when variant conversion is in use, so that browsers' font selections pick the matching variant. On this I concur completely.
Where we appear to have got bogged down is in how this relates to two things a) The user interface language selected in the user's wiki preferences. b) The Accept-Language header sent from the browser.
As for a), it may make sense for a lot of user-interface text to appear with its own overriding lang attributes. This isn't necessarily easy in all cases, but if it's appropriate it's something we can talk about.
As for b), by itself it doesn't seem to make a lot of sense to me; first, picking a language code from the client would simply cause it to fail to match with the content language. Second, the more general issue of picking user-interface language based on the header is something we've provisionally rejected for a long time; non-default UI languages are often not fully customized and would often produce a sub-par or confusing user experience. Additionally supporting it would require changes to the caching infrastructure and would increase server load by an unknown and potentially large amount.
b) appears to be entirely separate from the question at hand, so I'd prefer to leave it for another discussion.
-- brion vibber (brion @ pobox.com)