Can you pull it from the actual string instead of the database? If you're already there, you can just save the string at that point in
the
code.
... when I put a hex dump routine in matchAndRemove() in MagicWord.php, and dump the text which is being passed in, I get (ASCII printing chars displayed as ASCII, all others as hex):
20 R u n 20 t y p e 20 e2 80 9c g p e d i t . m s c e2 80 ? 20 a n d 20 p
The raw database may be using the wrong underlying encoding.
The table spec says:
CREATE TABLE `mywiki_text` ( ... ) ENGINE=MyISAM AUTO_INCREMENT=18452 DEFAULT CHARSET=latin1
You can load the current revision of a particular page and save it to
a
file [snip] (You can use maintenance/eval.php to run code within the MediaWiki framework from the command line.)
Sweet! I've done that, and this is the offending section:
20 52 75 6e 20 74 79 70 65 20 e2 80 9c 67 70 65 > Run type ...gpe< 64 69 74 2e 6d 73 63 e2 80 3f 20 61 6e 64 20 70 >dit.msc..? and p<
The bad sequence (after "gpedit.msc") is "e2 80 3f": the same as what I got with my hex dump in the code.
If I create a new article with the excellent text
My froo "sploit" bong.
using fancy quotes (by pasting in from Word), it dumps as:
4d 79 20 66 72 6f 6f 20 e2 80 9c 73 70 6c 6f 69 >My froo ...sploi< 74 e2 80 9d 20 62 6f 6e 67 2e >t... bong.<
Here, the close quote is correctly represented as "e2 80 9d".
So... extended characters seem to work OK now; so I guess the old pages must have got messed up in the upgrade somehow. Maybe something somewhere along the line thought that 9d was an invalid latin-1 character, and replaced it with "?"? But why not 80, or 9c?
All of this would be mildly annoying (close-quotes being dumped from old articles) if the PHP/PCRE bug didn't significantly aggravate it... but I can't really think of a way that MediaWiki can alleviate or fix this. I suppose I need to do a search-and-replace in the DB to fix the bad codes.
Ian