[Mediawiki-l] Re: Re: accents not appearing correctly

Brion Vibber brion at pobox.com
Tue Feb 7 19:19:10 UTC 2006


muyuubyou wrote:
> Mr Vibber, pissed or not, it would be wise to reply to questions from
> users with more diplomacy, regardless of the tone used in the question
> in the first place. Granted you were told to be "speaking nonsense"
> when you were right, but instead of "you have no idea what you're
> talking about, é and japanese code are both the same for UTF8 " you
> could have said "actually, both of them have to be encoded properly in
> UTF8" and nothing would have happened. Please don't take this
> personal.

I'm sorry if I was a bit snappy.

> Is this the right list for suggestions? if so, please take my previous
> comment about UTF8 and UTF16 as a suggestion, please don't snip it
> out. Just by having UTF8 AND UTF16 things would improve. Sure it's a
> lot of work, but it's just a thing to consider for the future.

If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5 mode, and
you know for sure that you aren't going to use compressed text storage, you
might be able to get away with changing text.old_text to a TEXT field type and
assigning it the ucs2 charset. This will store its data as UCS-2 instead of UTF-8.

You can do the same for any of the various name, comment, etc fields.

Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8 storage
is also limited so that characters outside the basic multilingual plane (the
classic 16-bit range) can't be stored at all. Attempting to insert these
characters will cause the field to become truncated (in UTF-8) or just corrupt
the character (in UCS-2).

If MySQL supported it, my preference would be to use UTF-16 with 16-bit
collation for the non-bulk-text fields; that is, allow clean translation to/from
compliant UTF-8 but keep the indexes at 2 bytes per code point. This would keep
the size of the indexes down compared to their UTF-8 support (which currently
needs 3 bytes per character and would need 4 if they made it actually support
full UTF-8).

Index size directly relates to key caching and index scanning performance, so on
a large-scale setup that can be relevant. (Bulk text storage is much less
significant in this respect; individual records are picked out cheaply based on
an integer index lookup.)

Alternatively, you could potentially whip up some kind of text storage handler
for MediaWiki that would convert the internal UTF-8 data into UTF-16 for storage
in the blob. I doubt this would be significantly pleasant though. :)

Using UTF-16 internally or for output isn't really possible.

> My issue with Firefox is happening in my installation but not in
> wikipedia. Not sure what it is, but I'll try to find out when I have
> more time.
> 
> The following only occurs with Chinese and Japanese text in page titles:
> 
> Basically when I pass the script an existing page, it opens it no
> problem in all browsers; but when I pass the script an unexisting one,
> it mangles the title only on firefox (don't have other mozilla
> browsers installed at the moment at home, must check it out with
> Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE
> based ones too. It's probably some strange behavior from the
> browser... but then again it doesn't happen with Wikipedia. Just in
> case someone has any pointers.

Do you have an example? How is it mangled, exactly?

In what way are you passing the data?
* Typing on the URL bar
* From an <a href> link on a web page
* From a <form> on a web page

If in a URL or link, is the title:
* percent-encoded UTF-8 bytes, per RFC 3987
* percent-encoded bytes in some other encoding, such as EUC-JP or Shift-JIS
* raw typed text

Current versions of IE are, I think, set to send unencoded characters in URLs as
percent-encoded UTF-8. Mozilla for some reason has left this option off, so
sometimes it'll send unencoded characters in <a href> links in the source page's
character set. I'm not sure what it'll do in the URL bar (locale encoding?) but
it seems to be happy to send UTF-8 from the URL bar on my Windows XP box if I
paste in some random Chinese text.

LanguageJa and LanguageZh doesn't set fallback encoding checks, so non-Unicode
encodings of Japanese and Chinese won't be detected or automatically converted.
(There are multiple character sets in use for these, making it extra difficult
compared with most European languages.)

-- brion vibber (brion @ pobox.com)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : http://lists.wikimedia.org/pipermail/mediawiki-l/attachments/20060207/fb681781/attachment.pgp 


More information about the MediaWiki-l mailing list