-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rolf Lampa wrote:
Mwdumper (http://www.mediawiki.org/wiki/Mwdumper) crashes (around 35000 pages) when processing the en-WP dump as of 2007-05-27, with the following error:
[snippy]
35,000 pages (377.851/sec), 35,000 revs (377.851/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
[snippy]
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
OS: GNU/Linux Xubuntu 6.10
I find I get somewhat similar errors when trying to read from an incompletely-downloaded file:
38,000 pages (356.6/sec), 38,000 revs (356.6/sec) 39,000 pages (358.862/sec), 39,000 revs (358.862/sec) Exception in thread "main" java.io.IOException: Invalid byte 1 of 1-byte UTF-8 sequence. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
Since mine happily reads past the point where you got your error, I'm guessing you've got an incomplete or corrupted download.
I'm not sure whether you're using the meta-current or the articles dump. The meta-current file is over 4 gigabytes, which I know can confuse some download tools (including some versions of wget), so you might want to make sure that you got the whole file intact.
Check the md5 checksums for that dump: http://download.wikimedia.org/enwiki/20070527/enwiki-20070527-md5sums.txt
I tested on Ubuntu 7.04 (x86_64), with Sun Java 1.5:
$ /usr/lib/jvm/java-1.5.0-sun/bin/java -version java version "1.5.0_11" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_11-b03, mixed mode)
and the mwdumper.jar copy from http://download.wikimedia.org/tools/mwdumper.jar
- -- brion vibber (brion @ wikimedia.org)