On 03/05/17 03:10, Mark Clements (HappyDog) wrote:
Can anyone confirm that MediaWiki used to behave in this manner, and if so why?
In MySQL 4.0, MySQL didn't really have character sets, it only had collations. Text was stored as 8-bit clean binary, and was only interpreted as a character sequence when compared to other text fields for collation purposes. There was no UTF-8 collation, so we stored UTF-8 text in text fields with the default (latin1) collation.
If it was due to MySQL bugs, does anyone know in what version these were fixed?
IIRC it was fixed in MySQL 4.1 with the introduction of proper character sets.
To migrate such a database, you need to do an ALTER TABLE to switch the relevant fields from latin1 to the "binary" character set. If you ALTER TABLE directly to utf8, you'll end up with "mojibake", since the text will be incorrectly interpreted as latin1 and converted to unicode. This is unrecoverable, you have to restore from a backup if this happens.
I think it is possible to then do an ALTER TABLE to switch from binary to utf8, but it's been a while since I tested that.
-- Tim Starling