I made a patch [0] for T39665 [1] about 6 months ago. It has been rotting in gerrit since.
The core bug is related to glibc's iconv implementation and PHP (and HHVM as well I think). To work around the iconv bug I wrote a little helper function that will use mb_convert_encoding() instead if it is present. in review PleaseStand pointed out that the libmbfl used by mb_convert_encoding has some differences in the supported character sets and character set naming [2] vs iconv.
I was hoping that someone on this list could step in and either convince me to abandon this patch and pretend I never investigated the problem or help design a solution that will plaster over these differences in a reasonable way.
[0]: https://gerrit.wikimedia.org/r/#/c/172101/ [1]: https://phabricator.wikimedia.org/T39665 [2]: https://php.net/manual/en/mbstring.encodings.php
Bryan
Hi!
Don't have a good solution, but some ideas:
1. There's http://php.net/manual/en/class.uconverter.php which uses ICU convertor. It can recognize tons of charsets/encodings (http://site.icu-project.org/charts/charset) and can filter out bad characters, though the way to achieve it may be a bit tricky. E.g.:
<?php class MyConverter extends UConverter { function toUCallback( $reason , $source , $codeUnits , &$error) { $error = 0; return null; } } $c = new MyConverter("UTF-8", "utf-8"); var_dump($c->convert("aa\xC3\xC3\xC3\xB8aa"));
(there might be a better way, it's just quick-n-dirty example). Con: while it's supported by hhvm, it's PHP 5.5+. Can be backported probably quite easily.
2. recode - http://php.net/manual/en/ref.recode.php. This: var_dump( recode("UTF-8..UTF-16,UTF16..UTF-8", "aa\xC3\xC3\xC3\xB8aa") ); seems to work fine. Not sure how is recode support for hhvm.
3. Patch libmbfl to add more aliases and missing encodings. Shouldn't be very hard though I'm not sure what is the policy about patching PHP/HHVM here.
4. Implement ezyang@php.net's suggestion at working around the glibs mess by adopting code from http://code.woboq.org/userspace/glibc/iconv/iconv_prog.c.html. Again, that would require custom patch for PHP, not sure about hhvm.
wikitech-l@lists.wikimedia.org