Hi all,
I encountered numerous invalid multi byte chars in the svwiki (Swedish) xml-dump as of y2006-m12-d08. I found some 74 errors using Delphi's Utf8ToAnsi(S) conversion routine (Delphi returns '', that is, empty strings on errors). I haven't yet confirmed it but I suspect the malformed chars originate from interwiki links (a bad bot?).
However, the dumps are not even accepted by Xml2Sql.exe and that's bad to say the least.
Is there any cheap trick to make a "bit- or byte-wise wash" of the xml dump-files? Or is there any tool out there capable of streaming through huge files tidying them up? (I can make my own "bitwise wash/mask" with Delphi-code if I just knew for sure what general bit-pattern (if any) to apply on the file).
Hints about how to best fix/tidy up the xml-files, anyone?
// Rolf Lampa