On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo <jcrespo(a)wikimedia.org> wrote:
On Tue, May 2, 2017 at 9:24 PM, Brian Wolff
<bawolff(a)gmail.com> wrote:
.
On the latest discussions, there are proposals to increase the minimum
mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
(not utf8, 3 byte utf8),
https://phabricator.wikimedia.org/T161232.
Utf8mb4
should be enough for most uses (utf8 will not
allow for emojis, for
example), although I am not up to date with the latest unicode standard
changes and MySQL features supporting them.
I dont know about mysql, but in unicode emojis are like any other astral
character, and utf-8 can encode them in 4 bytes*.
I am sorry I wasn't clear before, MySQL's utf8 IS NOT international
standard generally known as UTF-8, it is a bastardization of 3-byte max
UTF-8. MySQL's utf8mb4 is UTF-8:
Proof:
```
mysql> use test
Database changed
mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET
utf8mb4, c binary(4));
Query OK, 0 rows affected (0.02 sec)
mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9',
'\U+1F4A9');
Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> SHOW WARNINGS;
+---------+------+--------------------------------------------------------------------+
| Level | Code | Message
|
+---------+------+--------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column
'a' at row 1 |
+---------+------+--------------------------------------------------------------------+
1 row in set (0.01 sec)
mysql> SELECT * FROM test;
+------+------+------+
| a | b | c |
+------+------+------+
| ? | 💩 | 💩 | -- you will need an emoji-compatible font here
+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT hex(a), hex(b), hex(c) FROM test;
+--------+----------+----------+
| hex(a) | hex(b) | hex(c) |
+--------+----------+----------+
| 3F | F09F92A9 | F09F92A9 |
+--------+----------+----------+
1 row in set (0.00 sec)
```
To avoid truncations:
```
mysql> set sql_mode='TRADITIONAL'; --
https://phabricator.wikimedia.org/T108255
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9',
'\U+1F4A9');
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column
'a' at row 1
```
More info at:
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs.
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
--
Jaime Crespo
<http://wikimedia.org>
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Oh my bad, I had misread your previous email. I thought you were
talking about emoiji's in utf8mb4.
--
Brian