Hi Toolserver users and admins,
We've seen a problem regarding non-Latin (Unicode) texts in PHP [1], and this is already a long-standing issue. I'd like to wrap up the situation and would like to discuss how to get it better.
Here is the summary of the problem: Several widely-used string functions of PHP, including strupper() and ucfirst(), are known to "corrupt" strings when used under a UTF-8 locale [2], which is the current setting at the Toolserver. The problem is that those functions can incorrectly recognize a part of a multi-byte character sequence as a single-byte character. When those parts are converted into upper/lower cases, the resulting string will corrupt.
We've seen this problem has been breaking down the functionalities of a number of major tools on the Toolserver including Vvv's sulutils and SoxRed93's edit counter. For example, a Chinese/Japanese string "利用者" (meaning "user") doesn't have a capitalized form. However, when it's passed to a tool which (I assume) uses ucfirst, the first character is converted into a non-existent character [3], and the result doesn't make sense. An incomplete list of the affected tools is available at [4]. See also TS-923 [1] for more details.
River suggested [1] to solve it by migrating into multi-byte aware functions such as mb_strupper [5], but I think it's not an ideal solution. I'd totally encourage the migration too, but it would take time for all developers to fix their tools appropriately. I hope we can have a more fundamental, instant solution.
The synchronization of reports of similar problems [4] suggests that there was a underlying common reason. The behavior of string processing seem to have changed in different programs almost simultaneously, somewhere around October 2010. The underlying reason might be a side-effect from some changes in the PHP platform on the Toolserver, but I don't have any clue what it really was. If someone could point out the original reason, it would be a great help to step forward to a better solution.
Wikimedians and Toolserver users using multi-byte characters (including Arabic, Chinese, Korean and Japanese characters) have been apparently unhappy about this problem for more than half a year. I hope all the tools can (again) work more multilingually.
Any comments or suggestions?
[1] https://jira.toolserver.org/browse/TS-923 [2] http://www.phpwact.org/php/i18n/utf-8 [3] http://toolserver.org/~vvv/sulutil.php?user=%E5%88%A9%E7%94%A8%E8%80%85 [4] https://jira.toolserver.org/secure/ManageLinks.jspa?id=24486 [5] http://php.net/manual/en/function.mb-strtoupper.php
Cheers, Whym -- http://toolserver.org/~whym/