MySQL 5.0 does not reject the surrogate characters between U+D800 and U+DFFF. This means we can store characters above the BMP either by setting the character set to UTF-8 and inserting CESU-8, or by setting the character set to UCS-2 and inserting UTF-16.
It might be a hack, but at least it's a standard hack. After all, this is exactly what UTF-16 and CESU-8 were designed for. According to a certain online encyclopedia that we all know and love, the exact same problem is observed in Oracle, and CESU-8 is used to solve it.
Implementing it in MediaWiki would be a similar task to implementing PostgreSQL bytea support, which I was talking about in #mediawiki the other day. We could convert from UTF-8 to CESU-8 in Database::strencode(), and decode by altering the result rows returned by mysql_fetch_object() before they are returned by Database::fetchObject().
Binary data would have to be flagged both on write and read. On write, we could use a function other than strencode() where there is a need to construct raw SQL, and for the wrapper functions we could pass binary data in a blob object. When reading from the database, we could probably use mysql_field_type() or mysql_field_flags() to detect binary columns and skip the conversion accordingly.
It'd probably be easiest to do conversion offline, in the process of switching to MySQL 5.0. We wouldn't be converting the bulk text, just the metadata.
-- Tim Starling
Tim Starling wrote:
MySQL 5.0 does not reject the surrogate characters between U+D800 and U+DFFF. This means we can store characters above the BMP either by setting the character set to UTF-8 and inserting CESU-8, or by setting the character set to UCS-2 and inserting UTF-16.
Sorry, I didn't realise that this subject has already been discussed on this list:
http://mail.wikipedia.org/pipermail/wikitech-l/2005-October/031977.html
Except Brion called it pseudo-UTF-8 instead of CESU-8.
-- Tim Starling
Tim Starling wrote:
Tim Starling wrote:
MySQL 5.0 does not reject the surrogate characters between U+D800 and U+DFFF. This means we can store characters above the BMP either by setting the character set to UTF-8 and inserting CESU-8, or by setting the character set to UCS-2 and inserting UTF-16.
Sorry, I didn't realise that this subject has already been discussed on this list:
But it sounds much more plausible when you say it. ;)
In all seriousness, if we do have to go that route we actually have most of the plumbing already in place it looks like. For Oracle and PostgreSQL we're already adding special treatment for the binary data fields, which with some tweaking could distinguish 'text - for conversion' and 'data - leave as is' on SQL generation. If result sets come with the proper type information then doing the conversion back should be easy and transparent.
"In theory." :)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org