Am Montag, 19. Juli 2004 01:01 schrieb Brion Vibber:
Add a quick output filter to Special:Export; there's a particular character one's supposed to use for invalid chars (check the Unicode specs).
That would solve a part of the problem (the minor one), but the database would still contain illegal characters, which means that also the SQL dumps are broken. This is a pity because the python wikipeida bot can get information from these dumps (and thereby decrease server traffic), but not if the dump contains illegal characters. I could imagine that other people who want to reuse the dumps in other ways have similar problems.
So the real fix would be: * run through the 'old' database, and replace every occurence of illegal characters with the U+FFFD replacement character you mentioned. * implement a filter that does these replacements when new data comes in (i.e. when a user saves a page), so that no new broken characters come in. Our bot is fixed now, but there might be users with outdated browsers etc. which send bad data (or even the MediaWiki software itself, as just happened on es:).
A similar filter would have helped the ISO 8859-1 wikis to prevent illegal windows-1252 characters, but now it wouldn't be worth implementing it because they will be converted to UTF-8 sooner or later anyway.
Patches are welcome, please send them to wikitech-l.
I'm sorry, but I don't have any experience with MySQL and PHP.
Daniel