Hi!
Don't have a good solution, but some ideas:
1. There's
http://php.net/manual/en/class.uconverter.php which uses ICU
convertor. It can recognize tons of charsets/encodings
(
http://site.icu-project.org/charts/charset) and can filter out bad
characters, though the way to achieve it may be a bit tricky. E.g.:
<?php
class MyConverter extends UConverter {
function toUCallback( $reason , $source , $codeUnits , &$error) {
$error = 0;
return null;
}
}
$c = new MyConverter("UTF-8", "utf-8");
var_dump($c->convert("aa\xC3\xC3\xC3\xB8aa"));
(there might be a better way, it's just quick-n-dirty example). Con:
while it's supported by hhvm, it's PHP 5.5+. Can be backported probably
quite easily.
2. recode -
http://php.net/manual/en/ref.recode.php.
This:
var_dump( recode("UTF-8..UTF-16,UTF16..UTF-8", "aa\xC3\xC3\xC3\xB8aa")
);
seems to work fine. Not sure how is recode support for hhvm.
3. Patch libmbfl to add more aliases and missing encodings. Shouldn't be
very hard though I'm not sure what is the policy about patching PHP/HHVM
here.
4. Implement ezyang(a)php.net's suggestion at working around the glibs
mess by adopting code from
http://code.woboq.org/userspace/glibc/iconv/iconv_prog.c.html. Again,
that would require custom patch for PHP, not sure about hhvm.
--
Stas Malyshev
smalyshev(a)wikimedia.org