Neil Harris wrote:
It is that MySQL 5 cannot support characters outside the BMP at all, or just that it can't collate them properly? If it just handles > BMP UTF-8 sequences as binary data, might it simply sort them in Unicode code point order?
Or does it do something worse, and actively convert the Unicode characters into a 16-bit range, thus nuking characters outside the BMP. rather than storing, and largely processing, them as binary-encoded data for purposes other than collating?
I tested this yesterday, hence my post. To summarize the results:
Using a literal UTF-8 4-byte character in SQL statement, with connection on 'SET NAMES utf8' mode: * utf8 column: string is truncated at the problem character * ucs2 column: "????" is stored in place of problem character * blob column: works just fine (but no collation)
Using pseudo-UTF-8 with UTF-16 surrogate pair halves individually encoded: * utf8 column: works, but now we have bad encoding * ucs2 column: works, but now we have bad encoding * blob column: works, but now we have bad encoding
They won't be properly collated I'm sure, either.
In theory we could apply this tranformation but this will add a bunch of unnecessary and unreliable junk to the code. Automatically applying the transformation on all data could badly break binary storage (eg compressed text, the stuff we Really Don't Want To Lose).
If we apply it to page titles only, we might be able to get away with adding the transformation in eg the Title class:
* $title->getText() -> proper UTF-8, with spaces * $title->getUrl() -> proper UTF-8, with underscores * $title->getDbKey() -> fake UTF-8, with underscores
This of course means there's a nasssssty database dependency in the database-independent code, and could still break other things.
My preference, if possible, would be to get MySQL to fix their Unicode support to allow for either storage of full UTF-8 or proper transformation of UTF-8 to UTF-16. UCS-2 collation with UTF-16 conversion semantics would be "good enough" for us, I think, and avoids the 4-byte-per-character index bloat of extending the UTF-8 support.
-- brion vibber (brion @ pobox.com)