[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps

Rolf Lampa rolf.lampa at rilnet.com
Fri Jan 12 17:57:04 UTC 2007


Hi all,

I encountered numerous invalid multi byte chars in the svwiki (Swedish) 
xml-dump as of y2006-m12-d08. I found some 74 errors using Delphi's 
Utf8ToAnsi(S) conversion routine (Delphi returns '', that is, empty 
strings on errors).  I haven't yet confirmed it but I suspect the 
malformed chars originate from interwiki links (a bad bot?).

However, the dumps are not even accepted by Xml2Sql.exe and that's bad 
to say the least.

Is there any cheap trick to make a "bit- or byte-wise wash" of the xml 
dump-files? Or is there any tool out there capable of streaming through 
huge files tidying them up? (I can make my own "bitwise wash/mask" with 
Delphi-code if I just knew for sure what general bit-pattern (if any) to 
apply on the file).

Hints about how to best fix/tidy up the xml-files, anyone?

// Rolf Lampa



More information about the MediaWiki-l mailing list