Chad Perrin wrote:
PHP is supposedly planning to incorporate Python's ICU, which has some reasonable Unicode support for regexen, at some point in the future.
PHP already has unicode regex support, because PCRE has had it for some time and PHP just bundles that. In fact, the simplest way to split a UTF-8 string by character in PHP 4-5 with no mbstring is to do preg_match_all('/./u',...). MediaWiki uses this on occasion.
In PHP 6, they are moving to a 16-bit character type (not sure if it's UTF-16 or UCS-2), with a distinct binary string type. If "unicode semantics" are enabled, string literals will be unicode by default, and all the usual string operations would be character-wise. I dare say this would cause some backwards compatibility problems for applications such as MediaWiki.
PHP 6 requires ICU for its internal unicode support, but I'm not sure to what extent they will be providing interfaces to ICU's more complex functions. Note that ICU is not "Python's ICU", it's a library written by IBM which is natively C, C++ and Java. There is a set of swig wrappers to bind the C++ API to Python.
-- Tim Starling