Hi,
I am using mwdumper.jar to convert the dump into sql using the following command on Ubuntu 7.10 with Java 1.5.0_13
mediawiki-l@lists.wikimedia.orgnohup java -jar mwdumper.jar --format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 --filter=titlematch:[bB].* > b.sql 2>mwdumper.log2 &
and I am getting the following error
4,727,000 pages (1,685.36/sec), 4,727,000 revs (1,685.36/sec) 4,728,000 pages (1,685.5/sec), 4,728,000 revs (1,685.5/sec) 4,729,000 pages (1,685.604/sec), 4,729,000 revs (1,685.604/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) at javax.xml.parsers.SAXParser.parse(SAXParser.java:176) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
Any idea, anyone? What's going on?
I have checked this thread but of no use -- https://lists.wikimedia.org/mailman/htdig/mediawiki-l/2007-July/021537.html
md5sum of the downloaded dump file is correct. Can someone please help me out with this? Was anyone able to successfully import the latest dump (20080312)?
Thanks, Nazeer
Nazeer Hussain wrote:
Hi,
I am using mwdumper.jar to convert the dump into sql using the following command on Ubuntu 7.10 with Java 1.5.0_13
mediawiki-l@lists.wikimedia.orgnohup java -jar mwdumper.jar --format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 --filter=titlematch:[bB].* > b.sql 2>mwdumper.log2 &
and I am getting the following error
4,727,000 pages (1,685.36/sec), 4,727,000 revs (1,685.36/sec) 4,728,000 pages (1,685.5/sec), 4,728,000 revs (1,685.5/sec) 4,729,000 pages (1,685.604/sec), 4,729,000 revs (1,685.604/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
I believe this is the Xerces UTF-8 decoder bug we've been recently poking. There should be a workaround with a patched Xerces library in the latest mwdumper version...
I've updated the snapshot JAR, so go ahead and redownload it from: http://download.wikimedia.org/tools/mwdumper.jar
and see if that resolves it.
-- brion vibber (brion @ wikimedia.org)
mediawiki-l@lists.wikimedia.org