Here's a hex dump of the offending section:
So, the wiki interface *is* displayed, but the *article text* displays as empty? That's different from what people usually report as "blank pages" or "empty pages", which is when PHP sends no output because a fatal error occured and error display was not enabled.
Sorry, I thought I'd mentioned that this was only some pages, and that I'd traced it to preg_replace_callback...
Can you do a couple things?
No sweat.
- Ensure that the error reporting level is set to E_ALL, *including*
E_NOTICE and E_WARNING. You don't want to miss such messages when debugging.
Already set. I still don't see any PHP messages.
- Save the string which fails to a file and provide it.
When I save the "text" field from the "text" table to a file using MySql query browser, the offending section looks like this (the dodgy quotes are either side of "gpedit.msc" on the third line):
53 74 61 72 74 20 4d 65 6e 75 20 67 6f 20 74 6f >Start Menu go to< 20 52 75 6e 20 74 79 70 65 20 c3 a2 e2 82 ac c5 > Run type ......< 93 67 70 65 64 69 74 2e 6d 73 63 c3 a2 e2 82 ac
.gpedit.msc.....<
3f 20 61 6e 64 20 70 72 65 73 73 20 45 6e 74 65 >? and press Ente< 72 0a 2a 20 4e 6f 77 20 61 20 47 72 6f 75 70 20 >r.* Now a Group <
The code sequences "c3 a2 e2 82 ac c5 93" and "c3 a2 e2 82 ac" are both valid UTF-8. (Does MySQL browser do code conversion when it saves to a file? I would hope not.)
However, when I put a hex dump routine in matchAndRemove() in MagicWord.php, and dump the text which is being passed in, I get (ASCII printing chars displayed as ASCII, all others as hex):
20 R u n 20 t y p e 20 e2 80 9c g p e d i t . m s c e2 80 ? 20 a n d 20 p
The closing sequence, "e2 80 ?" ("?" being 3f), in binary is: 11100010 10000000 00111111 which is invalid UTF-8: the first byte identifies this as a 3-byte sequence, but the final byte is not a "following" byte. In this case, preg_replace_callback returns an empty string.
- Exact version of PHP, platform, etc?
* MediaWiki: 1.9.3 * PHP: 5.2.1 (apache2handler) * MySQL: 5.0.27-community-nt * Apache/2.2.4 (Win32) * MS Windows Server 2003, enterprise ed., SP 1
- Details of whether mb_string and iconv modules are available in
your
PHP
How do I tell? There's nothing in PBP\ext for these, or in php.ini.
So I guess I see two issues:
* How does the text in the database get converted to bad UTF-8?
* Why does preg_replace_callback dump the entire input when it sees a bad UTF-8 character (with the "u" flag supplied in the pattern)? I can't see anything in the standard that would justify this.
Unicode 4.0.0, section 3.9, Unicode Encoding Forms, says:
If, on the other hand, a Unicode string purports to be in a Unicode encoding form, then it must contain only a well-formed code unit sequence. If there is an ill-formed code unit sequence in a source Unicode string, then a conformant process that verifies that the Unicode string is in a Unicode encoding form must reject the ill-formed code unit sequence.
But not the entire input, surely? So this looks like a PHP bug to me.
Ian