[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps

Brion Vibber brion at pobox.com
Fri Jan 12 21:37:11 UTC 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rolf Lampa wrote:
> Is there any cheap trick to make a "bit- or byte-wise wash" of the xml 
> dump-files? Or is there any tool out there capable of streaming through 
> huge files tidying them up? (I can make my own "bitwise wash/mask" with 
> Delphi-code if I just knew for sure what general bit-pattern (if any) to 
> apply on the file).

There's already such verification run on the dumps as they're produced,
so it shouldn't be possible to get a file with invalid UTF-8 characters.

Can you be specific about where in the file they occur?

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFp/+HwRnhpk1wk44RAlnZAKDVsWOnsbAVkbLhnLQlyUUuV6AXDQCgzGei
eUTQEgPoXttwbG1rKqUCnTI=
=pUfN
-----END PGP SIGNATURE-----



More information about the MediaWiki-l mailing list