On 05/23/2013 11:31 PM, physik(a)physikerwelt.de wrote:
Hi,
I'm a testing a new rendering option for the <math /> element and had
problems to store MathML elements in the database field
math_mathml which is of type text.
The MathML elements contain a wide range of Unicode characters like the
INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte
chars like MATHEMATICAL BOLD CAPITAL A 0xF0 0x9D 0x90 0x80 .
In some rar cases I had problem to retrieve the stored value correctly from
MySQL.
To fix that problem I'm now using the PHP functions utf8_encode /decode to
which is not a very intuitive solution.
Do you know a better method to solve this issue without to change the
database layout.
Best
Physikerwelt
If you use MySQL, when you installed MediaWiki (or created the table),
did you choose the "UTF-8" option instead of "binary"? The underlying
MySQL character set is "utf8"[1], which does not support characters
above U+FFFF (four-byte characters).
This is mentioned in the web installer (message 'config-charset-help'):
In binary mode, MediaWiki stores UTF-8 text to the
database in binary
fields. This is more efficient than MySQL's UTF-8 mode, and allows
you to use the full range of Unicode characters. In UTF-8 mode, MySQL
will know what character set your data is in, and can present and
convert it appropriately, but it will not let you store characters
above the Basic Multilingual Plane[2]."
MySQL 5.5 did introduce a new "utf8mb4" character set, which does
support four-byte characters; however, MediaWiki does not currently
support that option (now filed as bug 48767).
The WMF of course has to use the 'binary' option (actually, UTF-8 stored
in latin1 columns, as mentioned in bug 32217) to allow storage
of all sorts of obscure characters from different languages.
utf8_encode()/utf8_decode() work around the problem because they replace
byte values 80 to FF with two-byte characters from U+0080 to U+00FF,
(encoded as C2 80 to C3 BF) and the 'utf8' option does allow those
characters.
[1]:
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html
[2]:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
--
Wikipedia user PleaseStand
http://en.wikipedia.org/wiki/User:PleaseStand