muyuubyou wrote:
Mr Vibber, pissed or not, it would be wise to reply to questions from users with more diplomacy, regardless of the tone used in the question in the first place. Granted you were told to be "speaking nonsense" when you were right, but instead of "you have no idea what you're talking about, é and japanese code are both the same for UTF8 " you could have said "actually, both of them have to be encoded properly in UTF8" and nothing would have happened. Please don't take this personal.
I'm sorry if I was a bit snappy.
Is this the right list for suggestions? if so, please take my previous comment about UTF8 and UTF16 as a suggestion, please don't snip it out. Just by having UTF8 AND UTF16 things would improve. Sure it's a lot of work, but it's just a thing to consider for the future.
If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5 mode, and you know for sure that you aren't going to use compressed text storage, you might be able to get away with changing text.old_text to a TEXT field type and assigning it the ucs2 charset. This will store its data as UCS-2 instead of UTF-8.
You can do the same for any of the various name, comment, etc fields.
Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8 storage is also limited so that characters outside the basic multilingual plane (the classic 16-bit range) can't be stored at all. Attempting to insert these characters will cause the field to become truncated (in UTF-8) or just corrupt the character (in UCS-2).
If MySQL supported it, my preference would be to use UTF-16 with 16-bit collation for the non-bulk-text fields; that is, allow clean translation to/from compliant UTF-8 but keep the indexes at 2 bytes per code point. This would keep the size of the indexes down compared to their UTF-8 support (which currently needs 3 bytes per character and would need 4 if they made it actually support full UTF-8).
Index size directly relates to key caching and index scanning performance, so on a large-scale setup that can be relevant. (Bulk text storage is much less significant in this respect; individual records are picked out cheaply based on an integer index lookup.)
Alternatively, you could potentially whip up some kind of text storage handler for MediaWiki that would convert the internal UTF-8 data into UTF-16 for storage in the blob. I doubt this would be significantly pleasant though. :)
Using UTF-16 internally or for output isn't really possible.
My issue with Firefox is happening in my installation but not in wikipedia. Not sure what it is, but I'll try to find out when I have more time.
The following only occurs with Chinese and Japanese text in page titles:
Basically when I pass the script an existing page, it opens it no problem in all browsers; but when I pass the script an unexisting one, it mangles the title only on firefox (don't have other mozilla browsers installed at the moment at home, must check it out with Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE based ones too. It's probably some strange behavior from the browser... but then again it doesn't happen with Wikipedia. Just in case someone has any pointers.
Do you have an example? How is it mangled, exactly?
In what way are you passing the data? * Typing on the URL bar * From an <a href> link on a web page * From a <form> on a web page
If in a URL or link, is the title: * percent-encoded UTF-8 bytes, per RFC 3987 * percent-encoded bytes in some other encoding, such as EUC-JP or Shift-JIS * raw typed text
Current versions of IE are, I think, set to send unencoded characters in URLs as percent-encoded UTF-8. Mozilla for some reason has left this option off, so sometimes it'll send unencoded characters in <a href> links in the source page's character set. I'm not sure what it'll do in the URL bar (locale encoding?) but it seems to be happy to send UTF-8 from the URL bar on my Windows XP box if I paste in some random Chinese text.
LanguageJa and LanguageZh doesn't set fallback encoding checks, so non-Unicode encodings of Japanese and Chinese won't be detected or automatically converted. (There are multiple character sets in use for these, making it extra difficult compared with most European languages.)
-- brion vibber (brion @ pobox.com)