Hi Toolserver users and admins,
We've seen a problem regarding non-Latin (Unicode) texts in PHP [1], and this is already a long-standing issue. I'd like to wrap up the situation and would like to discuss how to get it better.
Here is the summary of the problem: Several widely-used string functions of PHP, including strupper() and ucfirst(), are known to "corrupt" strings when used under a UTF-8 locale [2], which is the current setting at the Toolserver. The problem is that those functions can incorrectly recognize a part of a multi-byte character sequence as a single-byte character. When those parts are converted into upper/lower cases, the resulting string will corrupt.
We've seen this problem has been breaking down the functionalities of a number of major tools on the Toolserver including Vvv's sulutils and SoxRed93's edit counter. For example, a Chinese/Japanese string "利用者" (meaning "user") doesn't have a capitalized form. However, when it's passed to a tool which (I assume) uses ucfirst, the first character is converted into a non-existent character [3], and the result doesn't make sense. An incomplete list of the affected tools is available at [4]. See also TS-923 [1] for more details.
River suggested [1] to solve it by migrating into multi-byte aware functions such as mb_strupper [5], but I think it's not an ideal solution. I'd totally encourage the migration too, but it would take time for all developers to fix their tools appropriately. I hope we can have a more fundamental, instant solution.
The synchronization of reports of similar problems [4] suggests that there was a underlying common reason. The behavior of string processing seem to have changed in different programs almost simultaneously, somewhere around October 2010. The underlying reason might be a side-effect from some changes in the PHP platform on the Toolserver, but I don't have any clue what it really was. If someone could point out the original reason, it would be a great help to step forward to a better solution.
Wikimedians and Toolserver users using multi-byte characters (including Arabic, Chinese, Korean and Japanese characters) have been apparently unhappy about this problem for more than half a year. I hope all the tools can (again) work more multilingually.
Any comments or suggestions?
[1] https://jira.toolserver.org/browse/TS-923 [2] http://www.phpwact.org/php/i18n/utf-8 [3] http://toolserver.org/~vvv/sulutil.php?user=%E5%88%A9%E7%94%A8%E8%80%85 [4] https://jira.toolserver.org/secure/ManageLinks.jspa?id=24486 [5] http://php.net/manual/en/function.mb-strtoupper.php
Cheers, Whym -- http://toolserver.org/~whym/
Hi
Yusuke M schrieb:
Here is the summary of the problem: Several widely-used string functions of PHP, including strupper() and ucfirst(), are known to "corrupt" strings when used under a UTF-8 locale [2], which is the current setting at the Toolserver. […]
River suggested [1] to solve it by migrating into multi-byte aware functions such as mb_strupper [5], but I think it's not an ideal solution. I'd totally encourage the migration too, but it would take time for all developers to fix their tools appropriately.
That would be the right solution. Just send a mail via toolserver-announce.
I hope we can have a more fundamental, instant solution.
The synchronization of reports of similar problems [4] suggests that there was a underlying common reason. The behavior of string processing seem to have changed in different programs almost simultaneously, somewhere around October 2010. The underlying reason might be a side-effect from some changes in the PHP platform on the Toolserver, but I don't have any clue what it really was. If someone could point out the original reason, it would be a great help to step forward to a better solution.
It may be connected with TS-852 [*] which was resolved on 2010-12-08.
Kind regards Giftpflanze (gifti)
On Fri, May 13, 2011 at 11:51 PM, Giftpflanze m.p.roppelt@web.de wrote:
The behavior of string processing seem to have changed in different programs almost simultaneously, somewhere around October 2010.
It may be connected with TS-852 [*] which was resolved on 2010-12-08.
TS-852 was a change done later than October 2010. There was a change to php involving setlocale() that could be related, though.
On Sat, May 14, 2011 at 7:28 AM, Platonides platonides@gmail.com wrote:
TS-852 was a change done later than October 2010. There was a change to php involving setlocale() that could be related, though.
Do you know about the details of the change?
Is there a particular reason to use "en_US.UTF-8", not "C" for LC_CTYPE? Currently, setlocale("LC_CTYPE", "0") in PHP returns "en_US.UTF-8", which seems unreasonable. All others are filled with "C", as setlocale("LC_ALL", "0") returns "/en_US.UTF-8/C/C/C/C/C".
I hope we can change it to "C". As I mentioned in TS-923, with "C", the ucfirst and strtoupper doesn't corrupt strings.
- Whym
On Sat, May 14, 2011 at 7:28 AM, Platonides platonides@gmail.com wrote:
On Fri, May 13, 2011 at 11:51 PM, Giftpflanze m.p.roppelt@web.de wrote:
The behavior of string processing seem to have changed in different programs almost simultaneously, somewhere around October 2010.
It may be connected with TS-852 [*] which was resolved on 2010-12-08.
TS-852 was a change done later than October 2010. There was a change to php involving setlocale() that could be related, though.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
toolserver-l@lists.wikimedia.org