Hi all
When trying to process the current enwiki dump (specifically, enwiki-20080312-pages-articles.xml.bz2) using mwdumper, it crashed on me with an UTF-8-related I/O Error in Xerces. The problem occurrs (with slightly different symptoms) with Xerces 2.7.1 and 2.9.1, it is described in detail at [1]. Basically, Xerces' UTF-8 decoder is broken for the case that a surrogate pair is split across buffer reads. The problem was reported by Robert Stojnic aka rainman last year, but apparently the attempt to fix it only changed the way it is broken.
Anyway, the current dump isn't usable with mwdumper. This is not good. And it is likely to happen again.
I see two ways to solve it:
1) Ship a patched version of Xerces with mwdumper, with Robert's patch applied (see bug report at [1]). But there seems to be some problem with that patch (dicussed in the bug report), and relying on a patched version of a (supposedly) standard lib feels a but dirty.
2) Don't use Xerces' UTF-8 decoder, use the JRE's built in one. I have prepared a patch for this (see [2]), but I have not tested it excessively (only so far as that it doesn't blow up in my face). I havn't even verified that it actually fixes the problem (it takes half a day to get that far in processing the dump - that thing is huge, a small test case would be great for this). It would also be good to know how using the JRE's decoder impacts performance. Because of these questions, I havn't committed the patch yet. Please play with it if you have a couple of minutes for this kind of thing.
Regards, Daniel
[1] Xerces bug report https://issues.apache.org/jira/browse/XERCESJ-1257 [2] mwdumper patch: http://rafb.net/p/7c5bkg52.html