Surrogate pair hack - Wikitech-l

26 Oct 2005


      MySQL 5.0 does not reject the surrogate characters between U+D800 and
U+DFFF. This means we can store characters above the BMP either by setting
the character set to UTF-8 and inserting CESU-8, or by setting the character
set to UCS-2 and inserting UTF-16.
It might be a hack, but at least it's a standard hack. After all, this is
exactly what UTF-16 and CESU-8 were designed for. According to a certain
online encyclopedia that we all know and love, the exact same problem is
observed in Oracle, and CESU-8 is used to solve it.
Implementing it in MediaWiki would be a similar task to implementing
PostgreSQL bytea support, which I was talking about in #mediawiki the other
day. We could convert from UTF-8 to CESU-8 in Database::strencode(), and
decode by altering the result rows returned by mysql_fetch_object() before
they are returned by Database::fetchObject().
Binary data would have to be flagged both on write and read. On write, we
could use a function other than strencode() where there is a need to
construct raw SQL, and for the wrapper functions we could pass binary data
in a blob object. When reading from the database, we could probably use
mysql_field_type() or mysql_field_flags() to detect binary columns and skip
the conversion accordingly.
It'd probably be easiest to do conversion offline, in the process of
switching to MySQL 5.0. We wouldn't be converting the bulk text, just the
metadata.
-- Tim Starling