Hi,
I wonder how the UTF-8-Support in Mediawiki works and what valid combinations of database charsets and output charsets are.
As far as I understand in version 1.5 the default character set has changed to UTF-8. Therefore I suppose Mediawiki stores HTML-entities in the database per default (because Mysql 4.0 does not fully support UTF-8). Right?
Yesterday we tried to upgrade a 1.5x-Media-Wiki to Mysql 4.1 (the server was upgraded and the wiki was unfortunately affected). We found a character set mess within the latin1-database, which we cleaned up by find/replace in the dump file. Now we have UTF8 content in the database, the character set for the tables is set to UTF-8 and utf8 is used as charset in the output. We also enabled the Mysql5-experimental flag. Some parts of the page work all right, some do not (e.g. page titles), this was mentioned in the changelog file as todo.
Now it's broken and I would like to which combination is supposed to work. Is this one a possible combination? Database: Mysql 4.1 PHP: 5.1 Database-charset: Latin1, all content in the database is latin1 Output-charset: UTF-8
Thanks for any hint.
Regards
Dorthe
Dorthe Luebbert wrote:
I wonder how the UTF-8-Support in Mediawiki works and what valid combinations of database charsets and output charsets are.
As far as I understand in version 1.5 the default character set has changed to UTF-8.
The default has been UTF-8 since a long long time ago. In some older versions (possibly as late as 1.3), a handful of European languages had to be installed in Latin-1, English defaulted to UTF-8 but could optionally be Latin-1, and every other languages was UTF-8.
As of 1.4, UTF-8 was the default for all languages.
As of 1.5, Latin-1 is no longer supported.
Therefore I suppose Mediawiki stores HTML-entities in the database per default (because Mysql 4.0 does not fully support UTF-8). Right?
MySQL through 4.0 doesn't have native support for Unicode, so we just treat the fields as binary and store UTF-8 data in them directly.
MySQL 4.1 and later have somewhat fancier character set options including some broken Unicode support. By default, MediaWiki continues to treat it as on 4.0 and earlier; data is chucked in and retrieved as raw UTF-8 without worrying about the server's character set configuration.
Generally this works fine, though sometimes you'll get surprises if you let MySQL do implicit character conversion based on what it _thinks_ your tables contain.
In current 1.5 releases you may optionally have the tables created with the UTF-8 character set explicitly set, and UTF-8 explicitly set on the db connection.
This may or may not be helpful for some people for some reason; but mostly it will: * Make indexes larger (3 bytes per character) * Cause failures if you use characters outside the BOM in page titles, usernames, etc.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
- Cause failures if you use characters outside the BOM in page titles,
usernames, etc.
That of course should be BMP, not BOM.
*needs more sleep*
-- brion vibber (brion @ pobox.com)
mediawiki-l@lists.wikimedia.org