I'm having a problem with a parser function that reads a string from a database and displays it on a wiki page. The basic structure is this:
static function myParserFunction($parser, $arg) { $value = getFromDatabase(); return $value; }
Everything works great unless the value contains a multibyte character. In this case, after the parser function returns its value, something else goes wrong internal to MediaWiki and the ENTIRE content area of the article gets blanked. No PHP errors appear in the logs.
Here's another clue. If my parser function echoes the value instead of returning it:
static function myParserFunction($parser, $arg) { $value = getFromDatabase(); echo $value; return 'dummy'; }
then the value appears without a problem in the browser. So the value is correctly read from the database, and correctly returned by the parser function, but it fouls up something downstream.
Does anyone know what's going on, or have suggestions on how to debug further?
Thanks, DanB
On Wed, Apr 13, 2011 at 9:42 AM, Daniel Barrett danb@vistaprint.com wrote:
I'm having a problem with a parser function that reads a string from a database and displays it on a wiki page. The basic structure is this:
static function myParserFunction($parser, $arg) { $value = getFromDatabase(); return $value; }
Everything works great unless the value contains a multibyte character. In this case, after the parser function returns its value, something else goes wrong internal to MediaWiki and the ENTIRE content area of the article gets blanked. No PHP errors appear in the logs.
PCRE (the regex library that powers preg_match() etc) since some versions ago will drop the entire string and return null if it's expecting valid UTF-8 but receives invalid characters. Since there are many layers of regexes in the parsing, that tends to lead to stuff disappearing in this way.
Make sure that the string you've obtained is correctly encoded as UTF-8; if you're pulling from some custom database connection for instance the charset conversions might be wrong and you might be getting Latin-1 back from it without realizing it, which is leading to breakages later when the string's invalid UTF-8 byte sequences appear in the middle of the article during parsing.
-- brion
mediawiki-l@lists.wikimedia.org