Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1 tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
Can anyone confirm that MediaWiki used to behave in this manner, and if so why?
If it was due to MySQL bugs, does anyone know in what version these were fixed?
Finally, is current best-practice to use 'binary' or 'utf-8' for this? Why does MediaWiki make this configurable?
I have a very good understanding of character encodings and have no problems with performing whatever migrations are necessary - and the code itself is fully utf-8 compliant except for the database layer - but I'm just trying to understand the design choices (or technical limitations) that resulted in MediaWiki handling character encodings in this manner.
- Mark Clements (HappyDog)