On 05/23/2013 11:31 PM, physik@physikerwelt.de wrote:
Hi,
I'm a testing a new rendering option for the <math /> element and had problems to store MathML elements in the database field math_mathml which is of type text. The MathML elements contain a wide range of Unicode characters like the INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte chars like MATHEMATICAL BOLD CAPITAL A 0xF0 0x9D 0x90 0x80 . In some rar cases I had problem to retrieve the stored value correctly from MySQL. To fix that problem I'm now using the PHP functions utf8_encode /decode to which is not a very intuitive solution. Do you know a better method to solve this issue without to change the database layout.
Best Physikerwelt
If you use MySQL, when you installed MediaWiki (or created the table), did you choose the "UTF-8" option instead of "binary"? The underlying MySQL character set is "utf8"[1], which does not support characters above U+FFFF (four-byte characters).
This is mentioned in the web installer (message 'config-charset-help'):
In binary mode, MediaWiki stores UTF-8 text to the database in binary fields. This is more efficient than MySQL's UTF-8 mode, and allows you to use the full range of Unicode characters. In UTF-8 mode, MySQL will know what character set your data is in, and can present and convert it appropriately, but it will not let you store characters above the Basic Multilingual Plane[2]."
MySQL 5.5 did introduce a new "utf8mb4" character set, which does support four-byte characters; however, MediaWiki does not currently support that option (now filed as bug 48767).
The WMF of course has to use the 'binary' option (actually, UTF-8 stored in latin1 columns, as mentioned in bug 32217) to allow storage of all sorts of obscure characters from different languages.
utf8_encode()/utf8_decode() work around the problem because they replace byte values 80 to FF with two-byte characters from U+0080 to U+00FF, (encoded as C2 80 to C3 BF) and the 'utf8' option does allow those characters.
[1]: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html [2]: http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes