-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Daniel Kinzler wrote:
- Don't use Xerces' UTF-8 decoder, use the JRE's built in one. I have prepared
a patch for this (see [2]), but I have not tested it excessively (only so far as that it doesn't blow up in my face). I havn't even verified that it actually fixes the problem (it takes half a day to get that far in processing the dump - that thing is huge, a small test case would be great for this). It would also be good to know how using the JRE's decoder impacts performance. Because of these questions, I havn't committed the patch yet. Please play with it if you have a couple of minutes for this kind of thing.
In a quick test, performance is a bit slower, but more importantly it will fail if the input file isn't in UTF-8. That should always be the case in files we generate, but there's no guarantee. :)
Robert says he hasn't had any problems using the patched Xerces for search engine index builds, so he's going to go ahead and slip it into mwdumper's libs. (We bundle a copy of Xerces because there were other problems with the default XML libs on some older Java versions.)
Shouldn't be too hard to whip up a smaller test case file, though...
- -- brion vibber (brion @ wikimedia.org)