Hi,
This is a message I sent to the Python Wikipedia Bot mailing list today, which should be of concern for MediaWiki development as there might be other clients that send invalid UTF-8 (or even the MediaWiki software itself, as seen in [[es:Wikipedia:Registro_de_borrados]]).
---------- Forwarded Mail ----------
Subject: [pyWikipediaBot-users] Bot messed up databases on UTF-8 wikis Date: Sonntag, 18. Juli 2004 16:12 From: Daniel Herding DHerding@gmx.de To: pywikipediabot-users@lists.sourceforge.net
Hi,
wikipedia.putpage() always sent its edit summary messages as Latin-1 or something, even if it was editing a UTF-8 wiki which expected UTF-8 summary messages. (This concerns all bot functions which can have non-ASCII characters in their summary messages.)
This, of course, troubled the SQL databases, and if you look at
http://fr.wikipedia.org/w/wiki.phtml?title=10_mars&action=history
with Mozilla, it will show flashy question marks instead of special characters. The same happened on nds:, where Andre ran the interwiki bot, and probably on many other Wikipedias. I just can't believe that nobody noticed this, and I'm quite angry that nobody reported this bug.
I fixed this bug yesterday, as you can see here:
http://fr.wikipedia.org/w/wiki.phtml?title=Utilisateur:Head&action=histo...
but the databases are already fucked up. The XML export special page (which is used by interwiki.py) gives out crappy XML, which leads to a SAX parse bug. And my newly created sqldump.py is unusable for these wikis.
So I guess we should ask the MediaWiki developers to help us out. Maybe they can shut down the wiki for a while, then run over the 'old' database, replacing every non-UTF-8-byte with a question mark.
Daniel
----------- End of forwarded Mail ---------------
It would be nice if someone could repair the databases on fr:, nds:, es:, and other affected Wikipedias. You should also consider implementing a filter that stops users from posting illegal characters. Mail me if you need additional information.
Daniel
Daniel Herding wrote:
but the databases are already fucked up. The XML export special page (which is used by interwiki.py) gives out crappy XML, which leads to a SAX parse bug. And my newly created sqldump.py is unusable for these wikis.
So I guess we should ask the MediaWiki developers to help us out. Maybe they can shut down the wiki for a while, then run over the 'old' database, replacing every non-UTF-8-byte with a question mark.
Add a quick output filter to Special:Export; there's a particular character one's supposed to use for invalid chars (check the Unicode specs).
It would be nice if someone could repair the databases on fr:, nds:, es:, and other affected Wikipedias. You should also consider implementing a filter that stops users from posting illegal characters. Mail me if you need additional information.
Patches are welcome, please send them to wikitech-l.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Add a quick output filter to Special:Export; there's a particular character one's supposed to use for invalid chars (check the Unicode specs).
Here's the one: U+FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode
In UTF-8 that should be "\xEF\xBF\xBD".
Note that LanguageUtf8.php contains a regexp for checking whether a string is valid UTF-8, you may find this useful.
-- brion vibber (brion @ pobox.com)
Am Montag, 19. Juli 2004 01:01 schrieb Brion Vibber:
Add a quick output filter to Special:Export; there's a particular character one's supposed to use for invalid chars (check the Unicode specs).
That would solve a part of the problem (the minor one), but the database would still contain illegal characters, which means that also the SQL dumps are broken. This is a pity because the python wikipeida bot can get information from these dumps (and thereby decrease server traffic), but not if the dump contains illegal characters. I could imagine that other people who want to reuse the dumps in other ways have similar problems.
So the real fix would be: * run through the 'old' database, and replace every occurence of illegal characters with the U+FFFD replacement character you mentioned. * implement a filter that does these replacements when new data comes in (i.e. when a user saves a page), so that no new broken characters come in. Our bot is fixed now, but there might be users with outdated browsers etc. which send bad data (or even the MediaWiki software itself, as just happened on es:).
A similar filter would have helped the ISO 8859-1 wikis to prevent illegal windows-1252 characters, but now it wouldn't be worth implementing it because they will be converted to UTF-8 sooner or later anyway.
Patches are welcome, please send them to wikitech-l.
I'm sorry, but I don't have any experience with MySQL and PHP.
Daniel
wikitech-l@lists.wikimedia.org