Dorthe Luebbert wrote:
I wonder how the UTF-8-Support in Mediawiki works and what valid combinations of database charsets and output charsets are.
As far as I understand in version 1.5 the default character set has changed to UTF-8.
The default has been UTF-8 since a long long time ago. In some older versions (possibly as late as 1.3), a handful of European languages had to be installed in Latin-1, English defaulted to UTF-8 but could optionally be Latin-1, and every other languages was UTF-8.
As of 1.4, UTF-8 was the default for all languages.
As of 1.5, Latin-1 is no longer supported.
Therefore I suppose Mediawiki stores HTML-entities in the database per default (because Mysql 4.0 does not fully support UTF-8). Right?
MySQL through 4.0 doesn't have native support for Unicode, so we just treat the fields as binary and store UTF-8 data in them directly.
MySQL 4.1 and later have somewhat fancier character set options including some broken Unicode support. By default, MediaWiki continues to treat it as on 4.0 and earlier; data is chucked in and retrieved as raw UTF-8 without worrying about the server's character set configuration.
Generally this works fine, though sometimes you'll get surprises if you let MySQL do implicit character conversion based on what it _thinks_ your tables contain.
In current 1.5 releases you may optionally have the tables created with the UTF-8 character set explicitly set, and UTF-8 explicitly set on the db connection.
This may or may not be helpful for some people for some reason; but mostly it will: * Make indexes larger (3 bytes per character) * Cause failures if you use characters outside the BOM in page titles, usernames, etc.
-- brion vibber (brion @ pobox.com)