Nazeer Hussain wrote:
Hi,
I am using mwdumper.jar to convert the dump into sql using the following
command on Ubuntu 7.10 with Java 1.5.0_13
<mediawiki-l(a)lists.wikimedia.org>nohup java -jar mwdumper.jar
--format=sql:1.5 enwiki-latest-pages-articles.xml.bz2
--filter=titlematch:[bB].* > b.sql 2>mwdumper.log2 &
and I am getting the following error
4,727,000 pages (1,685.36/sec), 4,727,000 revs (1,685.36/sec)
4,728,000 pages (1,685.5/sec), 4,728,000 revs (1,685.5/sec)
4,729,000 pages (1,685.604/sec), 4,729,000 revs (1,685.604/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
I believe this is the Xerces UTF-8 decoder bug we've been recently
poking. There should be a workaround with a patched Xerces library in
the latest mwdumper version...
I've updated the snapshot JAR, so go ahead and redownload it from:
http://download.wikimedia.org/tools/mwdumper.jar
and see if that resolves it.
-- brion vibber (brion @
wikimedia.org)