-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Daniel Kinzler wrote:
2) Don't use Xerces' UTF-8 decoder, use the
JRE's built in one. I have prepared
a patch for this (see [2]), but I have not tested it excessively (only so far as
that it doesn't blow up in my face). I havn't even verified that it actually
fixes the problem (it takes half a day to get that far in processing the dump -
that thing is huge, a small test case would be great for this). It would also be
good to know how using the JRE's decoder impacts performance. Because of these
questions, I havn't committed the patch yet. Please play with it if you have a
couple of minutes for this kind of thing.
In a quick test, performance is a bit slower, but more importantly it
will fail if the input file isn't in UTF-8. That should always be the
case in files we generate, but there's no guarantee. :)
Robert says he hasn't had any problems using the patched Xerces for
search engine index builds, so he's going to go ahead and slip it into
mwdumper's libs. (We bundle a copy of Xerces because there were other
problems with the default XML libs on some older Java versions.)
Shouldn't be too hard to whip up a smaller test case file, though...
- -- brion vibber (brion @
wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org
iEYEARECAAYFAkf1WRIACgkQwRnhpk1wk446hwCg0KohAKGaJRdUr3mKzOkIZYft
yQYAnjZa7kzCcQNgzgxnMhjhWV9iu53+
=5FWl
-----END PGP SIGNATURE-----