[Mediawiki-l] parser function + multibyte characters = blank article?

Brion Vibber brion at pobox.com
Wed Apr 13 16:52:30 UTC 2011


On Wed, Apr 13, 2011 at 9:42 AM, Daniel Barrett <danb at vistaprint.com> wrote:

> I'm having a problem with a parser function that reads a string from a
> database and displays it on a wiki page. The basic structure is this:
>
>       static function myParserFunction($parser, $arg) {
>         $value = getFromDatabase();
>         return $value;
>       }
>
> Everything works great unless the value contains a multibyte character. In
> this case, after the parser function returns its value, something else goes
> wrong internal to MediaWiki and the ENTIRE content area of the article gets
> blanked. No PHP errors appear in the logs.
>

PCRE (the regex library that powers preg_match() etc) since some versions
ago will drop the entire string and return null if it's expecting valid
UTF-8 but receives invalid characters. Since there are many layers of
regexes in the parsing, that tends to lead to stuff disappearing in this
way.

Make sure that the string you've obtained is correctly encoded as UTF-8; if
you're pulling from some custom database connection for instance the charset
conversions might be wrong and you might be getting Latin-1 back from it
without realizing it, which is leading to breakages later when the string's
invalid UTF-8 byte sequences appear in the middle of the article during
parsing.

-- brion


More information about the MediaWiki-l mailing list