Hi,
This is a message I sent to the Python Wikipedia Bot mailing list today, which should be of concern for MediaWiki development as there might be other clients that send invalid UTF-8 (or even the MediaWiki software itself, as seen in [[es:Wikipedia:Registro_de_borrados]]).
---------- Forwarded Mail ----------
Subject: [pyWikipediaBot-users] Bot messed up databases on UTF-8 wikis Date: Sonntag, 18. Juli 2004 16:12 From: Daniel Herding DHerding@gmx.de To: pywikipediabot-users@lists.sourceforge.net
Hi,
wikipedia.putpage() always sent its edit summary messages as Latin-1 or something, even if it was editing a UTF-8 wiki which expected UTF-8 summary messages. (This concerns all bot functions which can have non-ASCII characters in their summary messages.)
This, of course, troubled the SQL databases, and if you look at
http://fr.wikipedia.org/w/wiki.phtml?title=10_mars&action=history
with Mozilla, it will show flashy question marks instead of special characters. The same happened on nds:, where Andre ran the interwiki bot, and probably on many other Wikipedias. I just can't believe that nobody noticed this, and I'm quite angry that nobody reported this bug.
I fixed this bug yesterday, as you can see here:
http://fr.wikipedia.org/w/wiki.phtml?title=Utilisateur:Head&action=histo...
but the databases are already fucked up. The XML export special page (which is used by interwiki.py) gives out crappy XML, which leads to a SAX parse bug. And my newly created sqldump.py is unusable for these wikis.
So I guess we should ask the MediaWiki developers to help us out. Maybe they can shut down the wiki for a while, then run over the 'old' database, replacing every non-UTF-8-byte with a question mark.
Daniel
----------- End of forwarded Mail ---------------
It would be nice if someone could repair the databases on fr:, nds:, es:, and other affected Wikipedias. You should also consider implementing a filter that stops users from posting illegal characters. Mail me if you need additional information.
Daniel