current enwiki dump not working with mwdumper - Wikitech-l

4 Apr 2008


      Hi all
When trying to process the current enwiki dump (specifically,
enwiki-20080312-pages-articles.xml.bz2) using mwdumper, it crashed on me with an
UTF-8-related I/O Error in Xerces. The problem occurrs (with slightly different
symptoms) with Xerces 2.7.1 and 2.9.1, it is described in detail at [1].
Basically, Xerces' UTF-8 decoder is broken for the case that a surrogate pair is
split across buffer reads. The problem was reported by Robert Stojnic aka
rainman last year, but apparently the attempt to fix it only changed the way it
is broken.
Anyway, the current dump isn't usable with mwdumper. This is not good. And it is
likely to happen again.
I see two ways to solve it:
1) Ship a patched version of Xerces with mwdumper, with Robert's patch applied
(see bug report at [1]). But there seems to be some problem with that patch
(dicussed in the bug report), and relying on a patched version of a (supposedly)
standard lib feels a but dirty.
2) Don't use Xerces' UTF-8 decoder, use the JRE's built in one. I have prepared
a patch for this (see [2]), but I have not tested it excessively (only so far as
that it doesn't blow up in my face). I havn't even verified that it actually
fixes the problem (it takes half a day to get that far in processing the dump -
that thing is huge, a small test case would be great for this). It would also be
good to know how using the JRE's decoder impacts performance. Because of these
questions, I havn't committed the patch yet. Please play with it if you have a
couple of minutes for this kind of thing.
Regards,
Daniel
[1] Xerces bug report https://issues.apache.org/jira/browse/XERCESJ-1257
[2] mwdumper patch: http://rafb.net/p/7c5bkg52.html