On Wed, Apr 13, 2011 at 9:42 AM, Daniel Barrett <danb(a)vistaprint.com> wrote:
I'm having a problem with a parser function that
reads a string from a
database and displays it on a wiki page. The basic structure is this:
static function myParserFunction($parser, $arg) {
$value = getFromDatabase();
return $value;
}
Everything works great unless the value contains a multibyte character. In
this case, after the parser function returns its value, something else goes
wrong internal to MediaWiki and the ENTIRE content area of the article gets
blanked. No PHP errors appear in the logs.
PCRE (the regex library that powers preg_match() etc) since some versions
ago will drop the entire string and return null if it's expecting valid
UTF-8 but receives invalid characters. Since there are many layers of
regexes in the parsing, that tends to lead to stuff disappearing in this
way.
Make sure that the string you've obtained is correctly encoded as UTF-8; if
you're pulling from some custom database connection for instance the charset
conversions might be wrong and you might be getting Latin-1 back from it
without realizing it, which is leading to breakages later when the string's
invalid UTF-8 byte sequences appear in the middle of the article during
parsing.
-- brion