On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
Indeed. See $wgLegacyEncoding https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding (and T128149 https://phabricator.wikimedia.org/T128149/T155529 https://phabricator.wikimedia.org/T155529).
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally
- it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
utf8 uses three bytes per character (ie. BMP only) so it's not a good idea to use it. utf8mb4 should work in theory. I think the only reason we don't use it is inertia (compatibility problems with old MySQL versions; lack of testing with MediaWiki; difficulty of migrating huge Wikimedia datasets).