I've already got PHP errors on (we have no security worries).
What I reported was that preg_replace_callback() is returning an empty string (or null? Haven't had time to get much detail yet) when passed an input string with possibly invalid UTF characters. This happens when it is asked to do a replacement in UTF mode, even when it doesn't find any matches. A consequence of this in MW is that the parser converts the stored version of an article to an empty string.
I don't think memory is an issue - the text is only 2K, and we handle much larger articles no problem. Besides, as I reported earlier, simply turning off UTF matches in MagicWord stops it happening.
Unfortunately, I haven't had time to get a dump of exactly what's in the database, but I guess that there is an invalid UTF8 sequence there.
My question is, where is the bug? The unicode standard requires the detection and suppression of invalid sequences, but is it legit for preg to dump the whole text?
Also, is this an issue with how the upgrade from 1.6 to 1.9 worked; or is it still possible to enter invalid sequences into an article? Seems like we either need better input filtering, or a database cleanup as part of the upgrade. (or both.)
The problem is highly annoying, and I'm worried about other side-effects of my workaround. Feedback on that from people who understand MW internals would be most welcome.
I guess I should do more investigation, buti thought I would bounce it off the list first, in case there are known issues here. It'll have to wait in any case, as I'm going to be scrubbing out the bilge of my boat all weekend. Help with that would be most welcome too... ;-)
Ian
Ian Smith Motorola | Good Technology Group ismith@motorola.com 408-352-7467 4250 Burton Drive, Santa Clara, CA 95054 www.motorola.com/good
Sent from my pocket PC using Good Mobile Messaging
-----Original Message----- From: Brion Vibber [mailto:brion@pobox.com] Sent: Saturday, March 17, 2007 06:38 AM Pacific Standard Time To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Possible PHP bug causes page blanking in 1.9.3
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ian Smith wrote:
Think you may have replied to the wrong email here... ;-)
No, but if you're getting blank pages, you probably want to look at the error messages instead of throwing them in /dev/null. :)
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ian Smith wrote:
I've already got PHP errors on (we have no security worries).
What I reported was that preg_replace_callback() is returning an empty string (or null? Haven't had time to get much detail yet) when passed an input string with possibly invalid UTF characters. This happens when it is asked to do a replacement in UTF mode, even when it doesn't find any matches. A consequence of this in MW is that the parser converts the stored version of an article to an empty string.
So, the wiki interface *is* displayed, but the *article text* displays as empty? That's different from what people usually report as "blank pages" or "empty pages", which is when PHP sends no output because a fatal error occured and error display was not enabled.
Can you do a couple things?
1) Ensure that the error reporting level is set to E_ALL, *including* E_NOTICE and E_WARNING. You don't want to miss such messages when debugging.
2) Save the string which fails to a file and provide it.
3) Exact version of PHP, platform, etc?
4) Details of whether mb_string and iconv modules are available in your PHP
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Here's a hex dump of the offending section:
So, the wiki interface *is* displayed, but the *article text* displays as empty? That's different from what people usually report as "blank pages" or "empty pages", which is when PHP sends no output because a fatal error occured and error display was not enabled.
Sorry, I thought I'd mentioned that this was only some pages, and that I'd traced it to preg_replace_callback...
Can you do a couple things?
No sweat.
- Ensure that the error reporting level is set to E_ALL, *including*
E_NOTICE and E_WARNING. You don't want to miss such messages when debugging.
Already set. I still don't see any PHP messages.
- Save the string which fails to a file and provide it.
When I save the "text" field from the "text" table to a file using MySql query browser, the offending section looks like this (the dodgy quotes are either side of "gpedit.msc" on the third line):
53 74 61 72 74 20 4d 65 6e 75 20 67 6f 20 74 6f >Start Menu go to< 20 52 75 6e 20 74 79 70 65 20 c3 a2 e2 82 ac c5 > Run type ......< 93 67 70 65 64 69 74 2e 6d 73 63 c3 a2 e2 82 ac
.gpedit.msc.....<
3f 20 61 6e 64 20 70 72 65 73 73 20 45 6e 74 65 >? and press Ente< 72 0a 2a 20 4e 6f 77 20 61 20 47 72 6f 75 70 20 >r.* Now a Group <
The code sequences "c3 a2 e2 82 ac c5 93" and "c3 a2 e2 82 ac" are both valid UTF-8. (Does MySQL browser do code conversion when it saves to a file? I would hope not.)
However, when I put a hex dump routine in matchAndRemove() in MagicWord.php, and dump the text which is being passed in, I get (ASCII printing chars displayed as ASCII, all others as hex):
20 R u n 20 t y p e 20 e2 80 9c g p e d i t . m s c e2 80 ? 20 a n d 20 p
The closing sequence, "e2 80 ?" ("?" being 3f), in binary is: 11100010 10000000 00111111 which is invalid UTF-8: the first byte identifies this as a 3-byte sequence, but the final byte is not a "following" byte. In this case, preg_replace_callback returns an empty string.
- Exact version of PHP, platform, etc?
* MediaWiki: 1.9.3 * PHP: 5.2.1 (apache2handler) * MySQL: 5.0.27-community-nt * Apache/2.2.4 (Win32) * MS Windows Server 2003, enterprise ed., SP 1
- Details of whether mb_string and iconv modules are available in
your
PHP
How do I tell? There's nothing in PBP\ext for these, or in php.ini.
So I guess I see two issues:
* How does the text in the database get converted to bad UTF-8?
* Why does preg_replace_callback dump the entire input when it sees a bad UTF-8 character (with the "u" flag supplied in the pattern)? I can't see anything in the standard that would justify this.
Unicode 4.0.0, section 3.9, Unicode Encoding Forms, says:
If, on the other hand, a Unicode string purports to be in a Unicode encoding form, then it must contain only a well-formed code unit sequence. If there is an ill-formed code unit sequence in a source Unicode string, then a conformant process that verifies that the Unicode string is in a Unicode encoding form must reject the ill-formed code unit sequence.
But not the entire input, surely? So this looks like a PHP bug to me.
Ian
On 3/19/07, Ian Smith ismith@good.com wrote:
- Details of whether mb_string and iconv modules are available in
your
PHP
How do I tell? There's nothing in PBP\ext for these, or in php.ini.
Since I am about to upgrade from 1.7.1 to 1.9.3 *and* my users like to copy-and-paste from Word to the Wiki, I'm very interested in seeing how this turns out...
To that end, I would do the following for this info: php -i | egrep "mb_string|mbstring|iconv"
Dan
Dan Davis wrote:
To that end, I would do the following for this info: php -i | egrep "mb_string|mbstring|iconv"
Which yields:
$ php -i | egrep "mb_string|mbstring|iconv" Registered Stream Filters => convert.iconv.*, string.rot13, string.toupper, string.tolower, string.strip_tags, convert.*, consumed, zlib.* iconv iconv support => enabled iconv implementation => "libiconv" iconv library version => 1.9 iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1
Thanks for the tip,
Ian
Well, it definitely seems that the preg_replace_callback issue is a PHP bug, so I've filed a report:
http://bugs.php.net/bug.php?id=40871
I still think there's a MediaWiki issue here, since it's generating the bad UTF-8 in the first place: I can't see it in the database, after all. This is a violation of the Unicode standard.
So, comments? Should I report an MW bug?
Ian
Ian Smith wrote:
- Save the string which fails to a file and provide it.
When I save the "text" field from the "text" table to a file using MySql query browser, the offending section looks like this (the dodgy quotes are either side of "gpedit.msc" on the third line):
[snip]
Can you pull it from the actual string instead of the database? If you're already there, you can just save the string at that point in the code.
You should also check the revision's actual contents. The raw database may be using the wrong underlying encoding. MediaWiki by default is optimized for MySQL 4.0, and uses UTF-8 encoding for all data without caring what MySQL thinks it is. In MySQL 4.1 or later, this may result in a raw fetch from MySQL turning back unexpected variations on the encoding, depending on how you access it.
You can load the current revision of a particular page and save it to a file like so:
$title = Title::newFromText("My page name"); $rev = Revision::newFromTitle($title); $text = $rev->getText(); file_put_contents("outfile.txt", $text);
(You can use maintenance/eval.php to run code within the MediaWiki framework from the command line.)
-- brion vibber (brion @ wikimedia.org)
Can you pull it from the actual string instead of the database? If you're already there, you can just save the string at that point in
the
code.
... when I put a hex dump routine in matchAndRemove() in MagicWord.php, and dump the text which is being passed in, I get (ASCII printing chars displayed as ASCII, all others as hex):
20 R u n 20 t y p e 20 e2 80 9c g p e d i t . m s c e2 80 ? 20 a n d 20 p
The raw database may be using the wrong underlying encoding.
The table spec says:
CREATE TABLE `mywiki_text` ( ... ) ENGINE=MyISAM AUTO_INCREMENT=18452 DEFAULT CHARSET=latin1
You can load the current revision of a particular page and save it to
a
file [snip] (You can use maintenance/eval.php to run code within the MediaWiki framework from the command line.)
Sweet! I've done that, and this is the offending section:
20 52 75 6e 20 74 79 70 65 20 e2 80 9c 67 70 65 > Run type ...gpe< 64 69 74 2e 6d 73 63 e2 80 3f 20 61 6e 64 20 70 >dit.msc..? and p<
The bad sequence (after "gpedit.msc") is "e2 80 3f": the same as what I got with my hex dump in the code.
If I create a new article with the excellent text
My froo "sploit" bong.
using fancy quotes (by pasting in from Word), it dumps as:
4d 79 20 66 72 6f 6f 20 e2 80 9c 73 70 6c 6f 69 >My froo ...sploi< 74 e2 80 9d 20 62 6f 6e 67 2e >t... bong.<
Here, the close quote is correctly represented as "e2 80 9d".
So... extended characters seem to work OK now; so I guess the old pages must have got messed up in the upgrade somehow. Maybe something somewhere along the line thought that 9d was an invalid latin-1 character, and replaced it with "?"? But why not 80, or 9c?
All of this would be mildly annoying (close-quotes being dumped from old articles) if the PHP/PCRE bug didn't significantly aggravate it... but I can't really think of a way that MediaWiki can alleviate or fix this. I suppose I need to do a search-and-replace in the DB to fix the bad codes.
Ian
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ian Smith wrote: [snip]
The table spec says:
CREATE TABLE `mywiki_text` ( ... ) ENGINE=MyISAM AUTO_INCREMENT=18452 DEFAULT CHARSET=latin1
You can load the current revision of a particular page and save it to
a
file [snip] (You can use maintenance/eval.php to run code within the MediaWiki framework from the command line.)
Sweet! I've done that, and this is the offending section:
20 52 75 6e 20 74 79 70 65 20 e2 80 9c 67 70 65 > Run type ...gpe< 64 69 74 2e 6d 73 63 e2 80 3f 20 61 6e 64 20 70 >dit.msc..? and p<
The bad sequence (after "gpedit.msc") is "e2 80 3f": the same as what I got with my hex dump in the code.
Ok, can you confirm whether you have dumped this database from another MySQL instance (for instance with mysqldump or phpmyadmin) and loaded it into the current one?
In that case, it's possible that your data was corrupted during this transfer. The corruption is caused by the two-way conversion between Windows-1252 (Latin-1) to UTF-8 and back. Unlike a simple conversion from ISO 8859-1 (Latin-1) to UTF-8 and back, this will irrecoverably destroy four byte values in the 0x80-0x9f range which do not have assigned characters in Windows-1252.
To prevent the corruption, use the --default-charset=latin1 option while dumping the original database with mysqldump. This prevents it from corrupting your data by applying false encoding conversions to the raw data.
- -- brion vibber (brion @ wikimedia.org)
Ok, can you confirm whether you have dumped this database from another MySQL instance (for instance with mysqldump or phpmyadmin) and loaded
it
into the current one?
Aha!!! Yes, that is indeed what I did. I was upgrading Apache, PHP and MySQL, so I cloned the old Wiki to the new setup on the same machine, then ran the MediaWiki upgrade.
In that case, it's possible that your data was corrupted during this transfer. The corruption is caused by the two-way conversion between Windows-1252 (Latin-1) to UTF-8 and back. Unlike a simple conversion from ISO 8859-1 (Latin-1) to UTF-8 and back, this will irrecoverably destroy four byte values in the 0x80-0x9f range which do not have assigned characters in Windows-1252.
OK, that makes perfect sense! Sorry, I guess I assumed that the mysqldump / restore -- particularly on the same machine -- would preserve the data. Clearly not.
To prevent the corruption, use the --default-charset=latin1 option
while
dumping the original database with mysqldump. This prevents it from corrupting your data by applying false encoding conversions to the raw data.
Sounds like excellent advice for people following in my footsteps... too late for me, but thanks for getting to the bottom of this, anyhow.
Maybe a note about this should be added to the wiki-moving instructions:
http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
Ian
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ian Smith wrote:
Ok, can you confirm whether you have dumped this database from another MySQL instance (for instance with mysqldump or phpmyadmin) and loaded
it
into the current one?
Aha!!! Yes, that is indeed what I did. I was upgrading Apache, PHP and MySQL, so I cloned the old Wiki to the new setup on the same machine, then ran the MediaWiki upgrade.
Ah good, one mystery solved. :)
Sorry about your data... :(
To prevent the corruption, use the --default-charset=latin1 option
while
dumping the original database with mysqldump. This prevents it from corrupting your data by applying false encoding conversions to the raw data.
Sounds like excellent advice for people following in my footsteps... too late for me, but thanks for getting to the bottom of this, anyhow.
Maybe a note about this should be added to the wiki-moving instructions:
Added.
- -- brion vibber (brion @ wikimedia.org)
mediawiki-l@lists.wikimedia.org