[Mediawiki-l] Re: Re: accents not appearing correctly

muyuubyou muyuubyou at gmail.com
Wed Feb 8 20:44:51 UTC 2006


That was a very interesting update about Unicode support in MySQL. Thanks!

Well, about my little issue with Firefox: I just type in the URL bar, and no
cookie. Following links works (for instance, link to pages created with
Opera, work also under Firefox).

For instance:

http://someIPhere/wiki/index.php/中文
I press enter, then the URL bar turns to:
http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84

, which is page "ÖÐÎÄ"

Ö is UTF16 => 0096, UTF8 => C3 96

Those are 8 bytes there in the URL, 4 for each letter... that shouldn't be.

If I Google 中文, Google returns me this page:
http://www.google.com/search?q=%E4%B8%AD%E6%96%87
which looks more UTF8 to me. And works, too.


Been browsing the Unicode chart and the first character is 4e2d:
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e2d
UTF8 => E4 B8 AD

The second one is 6587 (God, unicode.org is hell to browse this stuff)
which in turn is UTF8 => E6 96 87

I wonder what's the browser doing, because 中 has nothing to do with anything
starting C396 in any encoding.

Hope that helps somehow.

On 2/7/06, Brion Vibber <brion at pobox.com> wrote:
>
> muyuubyou wrote:
> > Mr Vibber, pissed or not, it would be wise to reply to questions from
> > users with more diplomacy, regardless of the tone used in the question
> > in the first place. Granted you were told to be "speaking nonsense"
> > when you were right, but instead of "you have no idea what you're
> > talking about, é and japanese code are both the same for UTF8 " you
> > could have said "actually, both of them have to be encoded properly in
> > UTF8" and nothing would have happened. Please don't take this
> > personal.
>
> I'm sorry if I was a bit snappy.
>
> > Is this the right list for suggestions? if so, please take my previous
> > comment about UTF8 and UTF16 as a suggestion, please don't snip it
> > out. Just by having UTF8 AND UTF16 things would improve. Sure it's a
> > lot of work, but it's just a thing to consider for the future.
>
> If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5
> mode, and
> you know for sure that you aren't going to use compressed text storage,
> you
> might be able to get away with changing text.old_text to a TEXT field type
> and
> assigning it the ucs2 charset. This will store its data as UCS-2 instead
> of UTF-8.
>
> You can do the same for any of the various name, comment, etc fields.
>
> Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8
> storage
> is also limited so that characters outside the basic multilingual plane
> (the
> classic 16-bit range) can't be stored at all. Attempting to insert these
> characters will cause the field to become truncated (in UTF-8) or just
> corrupt
> the character (in UCS-2).
>
> If MySQL supported it, my preference would be to use UTF-16 with 16-bit
> collation for the non-bulk-text fields; that is, allow clean translation
> to/from
> compliant UTF-8 but keep the indexes at 2 bytes per code point. This would
> keep
> the size of the indexes down compared to their UTF-8 support (which
> currently
> needs 3 bytes per character and would need 4 if they made it actually
> support
> full UTF-8).
>
> Index size directly relates to key caching and index scanning performance,
> so on
> a large-scale setup that can be relevant. (Bulk text storage is much less
> significant in this respect; individual records are picked out cheaply
> based on
> an integer index lookup.)
>
> Alternatively, you could potentially whip up some kind of text storage
> handler
> for MediaWiki that would convert the internal UTF-8 data into UTF-16 for
> storage
> in the blob. I doubt this would be significantly pleasant though. :)
>
> Using UTF-16 internally or for output isn't really possible.
>
> > My issue with Firefox is happening in my installation but not in
> > wikipedia. Not sure what it is, but I'll try to find out when I have
> > more time.
> >
> > The following only occurs with Chinese and Japanese text in page titles:
> >
> > Basically when I pass the script an existing page, it opens it no
> > problem in all browsers; but when I pass the script an unexisting one,
> > it mangles the title only on firefox (don't have other mozilla
> > browsers installed at the moment at home, must check it out with
> > Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE
> > based ones too. It's probably some strange behavior from the
> > browser... but then again it doesn't happen with Wikipedia. Just in
> > case someone has any pointers.
>
> Do you have an example? How is it mangled, exactly?
>
> In what way are you passing the data?
> * Typing on the URL bar
> * From an <a href> link on a web page
> * From a <form> on a web page
>
> If in a URL or link, is the title:
> * percent-encoded UTF-8 bytes, per RFC 3987
> * percent-encoded bytes in some other encoding, such as EUC-JP or
> Shift-JIS
> * raw typed text
>
> Current versions of IE are, I think, set to send unencoded characters in
> URLs as
> percent-encoded UTF-8. Mozilla for some reason has left this option off,
> so
> sometimes it'll send unencoded characters in <a href> links in the source
> page's
> character set. I'm not sure what it'll do in the URL bar (locale
> encoding?) but
> it seems to be happy to send UTF-8 from the URL bar on my Windows XP box
> if I
> paste in some random Chinese text.
>
> LanguageJa and LanguageZh doesn't set fallback encoding checks, so
> non-Unicode
> encodings of Japanese and Chinese won't be detected or automatically
> converted.
> (There are multiple character sets in use for these, making it extra
> difficult
> compared with most European languages.)
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at Wikimedia.org
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
>
>
>


More information about the MediaWiki-l mailing list