[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps
Brion Vibber
brion at pobox.com
Fri Jan 12 21:37:11 UTC 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Rolf Lampa wrote:
> Is there any cheap trick to make a "bit- or byte-wise wash" of the xml
> dump-files? Or is there any tool out there capable of streaming through
> huge files tidying them up? (I can make my own "bitwise wash/mask" with
> Delphi-code if I just knew for sure what general bit-pattern (if any) to
> apply on the file).
There's already such verification run on the dumps as they're produced,
so it shouldn't be possible to get a file with invalid UTF-8 characters.
Can you be specific about where in the file they occur?
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFp/+HwRnhpk1wk44RAlnZAKDVsWOnsbAVkbLhnLQlyUUuV6AXDQCgzGei
eUTQEgPoXttwbG1rKqUCnTI=
=pUfN
-----END PGP SIGNATURE-----
More information about the MediaWiki-l
mailing list