[Toolserver-l] fixing problems regarding PHP's multi-byte string processing

Fri May 13 14:06:20 UTC 2011

Hi Toolserver users and admins,

We've seen a problem regarding non-Latin (Unicode) texts in PHP [1],
and this is already a long-standing issue.  I'd like to wrap up the
situation and would like to discuss how to get it better.

Here is the summary of the problem: Several widely-used string
functions of PHP, including strupper() and ucfirst(), are known to
"corrupt" strings when used under a UTF-8 locale [2], which is the
current setting at the Toolserver.  The problem is that those
functions can incorrectly recognize a part of a multi-byte character
sequence as a single-byte character.   When those parts are converted
into upper/lower cases, the resulting string will corrupt.

We've seen this problem has been breaking down the functionalities of
a number of major tools on the Toolserver including Vvv's sulutils and
SoxRed93's edit counter. For example, a Chinese/Japanese string "利用者"
(meaning "user") doesn't have a capitalized form.  However, when it's
passed to a tool which (I assume) uses ucfirst, the first character is
converted into a non-existent character [3], and the result doesn't
make sense.  An incomplete list of the affected tools is available at
[4].  See also TS-923 [1] for more details.

River suggested [1] to solve it by migrating into multi-byte aware
functions such as mb_strupper [5], but I think it's not an ideal
solution.  I'd totally encourage the migration too, but it would take
time for all developers to fix their tools appropriately.  I hope we
can have a more fundamental, instant solution.

The synchronization of reports of similar problems [4] suggests that
there was a underlying common reason.  The behavior of string
processing seem to have changed in different programs almost
simultaneously, somewhere around October 2010.  The underlying reason
might be a side-effect from some changes in the PHP platform on the
Toolserver, but I don't have any clue what it really was.  If someone
could point out the original reason, it would be a great help to step
forward to a better solution.

Wikimedians and Toolserver users using multi-byte characters
(including Arabic, Chinese, Korean and Japanese characters) have been
apparently unhappy about this problem for more than half a year.  I
hope all the tools can (again) work more multilingually.

Any comments or suggestions?

[1] https://jira.toolserver.org/browse/TS-923
[2] http://www.phpwact.org/php/i18n/utf-8
[3] http://toolserver.org/~vvv/sulutil.php?user=%E5%88%A9%E7%94%A8%E8%80%85
[4] https://jira.toolserver.org/secure/ManageLinks.jspa?id=24486
[5] http://php.net/manual/en/function.mb-strtoupper.php

Cheers,
Whym
--
http://toolserver.org/~whym/