Nazeer Hussain wrote:
Hi,
I am using mwdumper.jar to convert the dump into sql using the following command on Ubuntu 7.10 with Java 1.5.0_13
mediawiki-l@lists.wikimedia.orgnohup java -jar mwdumper.jar --format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 --filter=titlematch:[bB].* > b.sql 2>mwdumper.log2 &
and I am getting the following error
4,727,000 pages (1,685.36/sec), 4,727,000 revs (1,685.36/sec) 4,728,000 pages (1,685.5/sec), 4,728,000 revs (1,685.5/sec) 4,729,000 pages (1,685.604/sec), 4,729,000 revs (1,685.604/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
I believe this is the Xerces UTF-8 decoder bug we've been recently poking. There should be a workaround with a patched Xerces library in the latest mwdumper version...
I've updated the snapshot JAR, so go ahead and redownload it from: http://download.wikimedia.org/tools/mwdumper.jar
and see if that resolves it.
-- brion vibber (brion @ wikimedia.org)