Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1 tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
Can anyone confirm that MediaWiki used to behave in this manner, and if so why?
If it was due to MySQL bugs, does anyone know in what version these were fixed?
Finally, is current best-practice to use 'binary' or 'utf-8' for this? Why does MediaWiki make this configurable?
I have a very good understanding of character encodings and have no problems with performing whatever migrations are necessary - and the code itself is fully utf-8 compliant except for the database layer - but I'm just trying to understand the design choices (or technical limitations) that resulted in MediaWiki handling character encodings in this manner.
- Mark Clements (HappyDog)
I thought MediaWiki, by default, stored data as binary blobs, rather than something of a particular encoding?
On May 2, 2017 at 10:11:38 AM, Mark Clements (HappyDog) ( gmane@kennel17.co.uk) wrote:
Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1 tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
Can anyone confirm that MediaWiki used to behave in this manner, and if so why?
If it was due to MySQL bugs, does anyone know in what version these were fixed?
Finally, is current best-practice to use 'binary' or 'utf-8' for this? Why does MediaWiki make this configurable?
I have a very good understanding of character encodings and have no problems with performing whatever migrations are necessary - and the code itself is fully utf-8 compliant except for the database layer - but I'm just trying to understand the design choices (or technical limitations) that resulted in MediaWiki handling character encodings in this manner.
- Mark Clements (HappyDog)
[1] https://www.mediawiki.org/wiki/Manual:$wgDBmysql5
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
Indeed. See $wgLegacyEncoding https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding (and T128149 https://phabricator.wikimedia.org/T128149/T155529 https://phabricator.wikimedia.org/T155529).
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally
- it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
utf8 uses three bytes per character (ie. BMP only) so it's not a good idea to use it. utf8mb4 should work in theory. I think the only reason we don't use it is inertia (compatibility problems with old MySQL versions; lack of testing with MediaWiki; difficulty of migrating huge Wikimedia datasets).
Mark,
On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:
Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I notice that there is now the option to use either 'utf8' or 'binary' columns (via the $wgDBmysql5 setting), and the default appears to be 'binary'.[1]
I can provide you general information about the MySQL side of things.
'utf8' in MySQL is 3-bytes UTF-8. "Real" UTF-8 is called in MySQL utf8mb4. While this may sound silly, think that emojies and characters beyond the basic multilingual plane were probably more theoretical than practical 10-15 years ago, and variable-string performance was not good for MySQL on those early versions.
I know there was some conversion pain in the past, but right now, in order to be as compatible as possible, on WMF servers binary collation is being used almost everywhere (there may be some old text not converted, but this is true for most live data/metadata databases that I have seen). Mediawiki only requires MySQL 5.0 and using binary strings allows to support collations and charsets only available on the latest MySQL/MariaDB versions.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232. Utf8mb4 should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I've come across an old project which followed MediaWiki's lead (literally
- it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm hesitant to simply switch to 'utf8' without understanding the reasons for this initial implementation decision.
I strongly suggest to go for utf8mb4, if mysql >=5.5, and only binary if you have some special needs that that doesn't cover. InnoDB variable-length performance has been "fixed" on the newest InnoDB versions and it is the recommended deafault nowadays.
Cheers,
.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I dont know about mysql, but in unicode emojis are like any other astral character, and utf-8 can encode them in 4 bytes*.
* there are some technicalities in that some emoiji are actually multiple code points that join together to render one glyph (for example country flags take 2 codepoints to make a flag for a total of 8 bytes, there are some modifier characters that simply change the skin colour of emoiji, etc) however this should not affect anything on the db layer as they are just treated as multiple characters.
-- Brian
On Tue, May 2, 2017 at 9:24 PM, Brian Wolff bawolff@gmail.com wrote:
.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I dont know about mysql, but in unicode emojis are like any other astral character, and utf-8 can encode them in 4 bytes*.
I am sorry I wasn't clear before, MySQL's utf8 IS NOT international standard generally known as UTF-8, it is a bastardization of 3-byte max UTF-8. MySQL's utf8mb4 is UTF-8:
Proof:
``` mysql> use test Database changed mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET utf8mb4, c binary(4)); Query OK, 0 rows affected (0.02 sec)
mysql> SET NAMES utf8mb4; Query OK, 0 rows affected (0.00 sec)
mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> SHOW WARNINGS; +---------+------+--------------------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1 | +---------+------+--------------------------------------------------------------------+ 1 row in set (0.01 sec)
mysql> SELECT * FROM test; +------+------+------+ | a | b | c | +------+------+------+ | ? | 💩 | 💩 | -- you will need an emoji-compatible font here +------+------+------+ 1 row in set (0.00 sec)
mysql> SELECT hex(a), hex(b), hex(c) FROM test; +--------+----------+----------+ | hex(a) | hex(b) | hex(c) | +--------+----------+----------+ | 3F | F09F92A9 | F09F92A9 | +--------+----------+----------+ 1 row in set (0.00 sec) ```
To avoid truncations:
``` mysql> set sql_mode='TRADITIONAL'; -- https://phabricator.wikimedia.org/T108255 Query OK, 0 rows affected (0.00 sec)
mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1 ```
More info at: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs. https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo jcrespo@wikimedia.org wrote:
On Tue, May 2, 2017 at 9:24 PM, Brian Wolff bawolff@gmail.com wrote:
.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I dont know about mysql, but in unicode emojis are like any other astral character, and utf-8 can encode them in 4 bytes*.
I am sorry I wasn't clear before, MySQL's utf8 IS NOT international standard generally known as UTF-8, it is a bastardization of 3-byte max UTF-8. MySQL's utf8mb4 is UTF-8:
Proof:
mysql> use test Database changed mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET utf8mb4, c binary(4)); Query OK, 0 rows affected (0.02 sec) mysql> SET NAMES utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> SHOW WARNINGS; +---------+------+--------------------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1 | +---------+------+--------------------------------------------------------------------+ 1 row in set (0.01 sec) mysql> SELECT * FROM test; +------+------+------+ | a | b | c | +------+------+------+ | ? | 💩 | 💩 | -- you will need an emoji-compatible font here +------+------+------+ 1 row in set (0.00 sec) mysql> SELECT hex(a), hex(b), hex(c) FROM test; +--------+----------+----------+ | hex(a) | hex(b) | hex(c) | +--------+----------+----------+ | 3F | F09F92A9 | F09F92A9 | +--------+----------+----------+ 1 row in set (0.00 sec)
To avoid truncations:
mysql> set sql_mode='TRADITIONAL'; -- https://phabricator.wikimedia.org/T108255 Query OK, 0 rows affected (0.00 sec) mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1
More info at: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs. https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
-- Jaime Crespo http://wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Oh my bad, I had misread your previous email. I thought you were talking about emoiji's in utf8mb4.
-- Brian
On 03/05/17 03:10, Mark Clements (HappyDog) wrote:
Can anyone confirm that MediaWiki used to behave in this manner, and if so why?
In MySQL 4.0, MySQL didn't really have character sets, it only had collations. Text was stored as 8-bit clean binary, and was only interpreted as a character sequence when compared to other text fields for collation purposes. There was no UTF-8 collation, so we stored UTF-8 text in text fields with the default (latin1) collation.
If it was due to MySQL bugs, does anyone know in what version these were fixed?
IIRC it was fixed in MySQL 4.1 with the introduction of proper character sets.
To migrate such a database, you need to do an ALTER TABLE to switch the relevant fields from latin1 to the "binary" character set. If you ALTER TABLE directly to utf8, you'll end up with "mojibake", since the text will be incorrectly interpreted as latin1 and converted to unicode. This is unrecoverable, you have to restore from a backup if this happens.
I think it is possible to then do an ALTER TABLE to switch from binary to utf8, but it's been a while since I tested that.
-- Tim Starling
On 2 May 2017 at 19:10, Mark Clements (HappyDog) gmane@kennel17.co.uk wrote:
Hi all,
I seem to recall that a long, long time ago MediaWiki was using UTF-8 internally but storing the data in 'latin1' fields in MySQL.
I remember a old thread in 2009.
https://lists.gt.net/wiki/wikitech/160875
wikitech-l@lists.wikimedia.org