Re: [Wikitech-l] iconv/mb_convert expert needed to advise on a patch for T39665

6 May 2015


      Hi!
Don't have a good solution, but some ideas:
1. There's http://php.net/manual/en/class.uconverter.php which uses ICU
convertor. It can recognize tons of charsets/encodings
(http://site.icu-project.org/charts/charset) and can filter out bad
characters, though the way to achieve it may be a bit tricky. E.g.:
<?php
class MyConverter extends UConverter {
    function toUCallback( $reason ,  $source ,  $codeUnits ,  &$error) {
    $error = 0;
        return null;
    }
}
$c = new MyConverter("UTF-8", "utf-8");
var_dump($c->convert("aa\xC3\xC3\xC3\xB8aa"));
(there might be a better way, it's just quick-n-dirty example). Con:
while it's supported by hhvm, it's PHP 5.5+. Can be backported probably
quite easily.
2. recode - http://php.net/manual/en/ref.recode.php.
This:
var_dump( recode("UTF-8..UTF-16,UTF16..UTF-8", "aa\xC3\xC3\xC3\xB8aa") );
seems to work fine. Not sure how is recode support for hhvm.
3. Patch libmbfl to add more aliases and missing encodings. Shouldn't be
very hard though I'm not sure what is the policy about patching PHP/HHVM
here.
4. Implement ezyang@php.net's suggestion at working around the glibs
mess by adopting code from
http://code.woboq.org/userspace/glibc/iconv/iconv_prog.c.html. Again,
that would require custom patch for PHP, not sure about hhvm.
-- 
Stas Malyshev
smalyshev@wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] iconv/mb_convert expert needed to advise on a patch for T39665