Hi all,
Problems with mwdumper
Mwdumper (http://www.mediawiki.org/wiki/Mwdumper) crashes (around 35000 pages) when processing the en-WP dump as of 2007-05-27, with the following error:
root@xubuntu-svn:/home/admin/Desktop# jdk1.5.0_12/bin/java -jar mwdumper.jar --format=sql:1.5 enwp-200707 > enwp-200707.sql ... 32,000 pages (373.893/sec), 32,000 revs (373.893/sec) 33,000 pages (373.206/sec), 33,000 revs (373.206/sec) 34,000 pages (377.979/sec), 34,000 revs (377.979/sec) 35,000 pages (377.851/sec), 35,000 revs (377.851/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$Frag mentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scan Document(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) at javax.xml.parsers.SAXParser.parse(SAXParser.java:176) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source) root@xubuntu:/home/admin/Desktop#
More info about the environment:
Java version: root@xubuntu:/home/admin/Desktop# sudo ./jdk1.5.0_12/bin/java -version java version "1.5.0_12" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04) Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
OS: GNU/Linux Xubuntu 6.10 Kernel release: 2.6.17-10-generic, Kernel version: #2 SMP Fri Oct 13 18:45:35 UTC 2006
Any ideas anyone?
Regards,
// Rolf Lampa
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rolf Lampa wrote:
Mwdumper (http://www.mediawiki.org/wiki/Mwdumper) crashes (around 35000 pages) when processing the en-WP dump as of 2007-05-27, with the following error:
[snippy]
35,000 pages (377.851/sec), 35,000 revs (377.851/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
[snippy]
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
OS: GNU/Linux Xubuntu 6.10
I find I get somewhat similar errors when trying to read from an incompletely-downloaded file:
38,000 pages (356.6/sec), 38,000 revs (356.6/sec) 39,000 pages (358.862/sec), 39,000 revs (358.862/sec) Exception in thread "main" java.io.IOException: Invalid byte 1 of 1-byte UTF-8 sequence. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
Since mine happily reads past the point where you got your error, I'm guessing you've got an incomplete or corrupted download.
I'm not sure whether you're using the meta-current or the articles dump. The meta-current file is over 4 gigabytes, which I know can confuse some download tools (including some versions of wget), so you might want to make sure that you got the whole file intact.
Check the md5 checksums for that dump: http://download.wikimedia.org/enwiki/20070527/enwiki-20070527-md5sums.txt
I tested on Ubuntu 7.04 (x86_64), with Sun Java 1.5:
$ /usr/lib/jvm/java-1.5.0-sun/bin/java -version java version "1.5.0_11" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_11-b03, mixed mode)
and the mwdumper.jar copy from http://download.wikimedia.org/tools/mwdumper.jar
- -- brion vibber (brion @ wikimedia.org)
mediawiki-l@lists.wikimedia.org