Mark,
On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:
Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I can provide you general information about the MySQL side of things.
'utf8' in MySQL is 3-bytes UTF-8. "Real" UTF-8 is called in MySQL utf8mb4. While this may sound silly, think that emojies and characters beyond the basic multilingual plane were probably more theoretical than practical 10-15 years ago, and variable-string performance was not good for MySQL on those early versions.
I know there was some conversion pain in the past, but right now, in order to be as compatible as possible, on WMF servers binary collation is being used almost everywhere (there may be some old text not converted, but this is true for most live data/metadata databases that I have seen). Mediawiki only requires MySQL 5.0 and using binary strings allows to support collations and charsets only available on the latest MySQL/MariaDB versions.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232. Utf8mb4 should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I've come across an old project which followed MediaWiki's lead (literally
- it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
I strongly suggest to go for utf8mb4, if mysql >=5.5, and only binary if you have some special needs that that doesn't cover. InnoDB variable-length performance has been "fixed" on the newest InnoDB versions and it is the recommended deafault nowadays.
Cheers,