[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps
Rolf Lampa
rolf.lampa at rilnet.com
Fri Jan 12 17:57:04 UTC 2007
Hi all,
I encountered numerous invalid multi byte chars in the svwiki (Swedish)
xml-dump as of y2006-m12-d08. I found some 74 errors using Delphi's
Utf8ToAnsi(S) conversion routine (Delphi returns '', that is, empty
strings on errors). I haven't yet confirmed it but I suspect the
malformed chars originate from interwiki links (a bad bot?).
However, the dumps are not even accepted by Xml2Sql.exe and that's bad
to say the least.
Is there any cheap trick to make a "bit- or byte-wise wash" of the xml
dump-files? Or is there any tool out there capable of streaming through
huge files tidying them up? (I can make my own "bitwise wash/mask" with
Delphi-code if I just knew for sure what general bit-pattern (if any) to
apply on the file).
Hints about how to best fix/tidy up the xml-files, anyone?
// Rolf Lampa
More information about the MediaWiki-l
mailing list