-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
howard chen wrote:
i would like to know if `wikipedia` is going to use this method in the future?
If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting.
BUT... at the moment MySQL only supports a subset of UTF-8 which corresponds to UCS-2 (limited to the lower 16 bits of Unicode's encoding space).
Thus characters from a number of scripts encoded outside of that range cannot be represented without resorting to storing raw UTF-8 in binary fields. Since we already have data outside that range, we don't plan to reduce our functionality to shoehorn it into a broken UTF-8 implementation.
(There is an experimental MediaWiki mode for using UTF-8 collation, but since the functionality in MySQL is incomplete it doesn't fully work. It's also not properly integrated with the updaters, so an experimental database in this mode may not properly update on version upgrades.)
We have expressed our interest in full UTF-8 support (or UTF-16 with proper conversion should do fine!) to MySQL, but as far as I know it's still not on the roadmap as of 5.2. Maybe some more lobbying is in order. ;)
Since the forseeable future does not include full UTF-8 support in MySQL, when we upgrade to 5.0 or 5.1 we expect to continue using binary encoding.
We do though plan to 'formalize' that a bit more, with proper binary charset/collation labeling on the fields, rather than the ad-hoc method we've used since 4.0. The experimental 'binary' schema on MediaWiki 1.9 or later can be tested to play with this, but be warned it's even more experimental than the UTF-8 one at the moment.
- -- brion vibber (brion @ pobox.com)