On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo jcrespo@wikimedia.org wrote:
On Tue, May 2, 2017 at 9:24 PM, Brian Wolff bawolff@gmail.com wrote:
.
On the latest discussions, there are proposals to increase the minimum mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for example), although I am not up to date with the latest unicode standard changes and MySQL features supporting them.
I dont know about mysql, but in unicode emojis are like any other astral character, and utf-8 can encode them in 4 bytes*.
I am sorry I wasn't clear before, MySQL's utf8 IS NOT international standard generally known as UTF-8, it is a bastardization of 3-byte max UTF-8. MySQL's utf8mb4 is UTF-8:
Proof:
mysql> use test Database changed mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET utf8mb4, c binary(4)); Query OK, 0 rows affected (0.02 sec) mysql> SET NAMES utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> SHOW WARNINGS; +---------+------+--------------------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1 | +---------+------+--------------------------------------------------------------------+ 1 row in set (0.01 sec) mysql> SELECT * FROM test; +------+------+------+ | a | b | c | +------+------+------+ | ? | 💩 | 💩 | -- you will need an emoji-compatible font here +------+------+------+ 1 row in set (0.00 sec) mysql> SELECT hex(a), hex(b), hex(c) FROM test; +--------+----------+----------+ | hex(a) | hex(b) | hex(c) | +--------+----------+----------+ | 3F | F09F92A9 | F09F92A9 | +--------+----------+----------+ 1 row in set (0.00 sec)
To avoid truncations:
mysql> set sql_mode='TRADITIONAL'; -- https://phabricator.wikimedia.org/T108255 Query OK, 0 rows affected (0.00 sec) mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column 'a' at row 1
More info at: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs. https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
-- Jaime Crespo http://wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Oh my bad, I had misread your previous email. I thought you were talking about emoiji's in utf8mb4.
-- Brian