Hi,
I have been importing the English Wikipeida XML Dumps every few months (last time I did this was in June). I then used xml2sql and it always worked for me. Now I attempted the import on the latest dump enwiki-20090920-pages-articles.xml (and on the dump from enwiki-20090810-pages-articles.xml), both of these have the error:
$ xml2sql enwiki-20090920-pages-articles.xml
unexpected element <redirect> xml2sql: parsing aborted at line 33 pos 16.
So then I try mwdumper and after 1.4 M Pages, it craps out: …… 1,423,000 pages (957.283/sec), 1,423,000 revs (957.283/sec) 1,424,000 pages (957.465/sec), 1,424,000 revs (957.465/sec) Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
I tried the importDump.php and I get errors of the kind (MediaWiki 1.14.0) … Warning: xml_parse(): Unable to call handler in_() in /var/www/includes/Import.php on line 437 Warning: xml_parse(): Unable to call handler in_() in /var/www/includes/Import.php on line 437 Warning: xml_parse(): Unable to call handler out_() in /var/www/includes/Import.php on line 437 …. (Sorry I don’t know where this error starts, but it processes a few thousand pages, up till I get sick of looking at it before failing.)
Any ideas if the format of the XML files have changed because I can swear that as of June or may be May, I had xml2sql working. I know that I might need to upgrade MediaWiki to 1.15, however importDump.php usually does not work for the English Wikipedia anyways.
I would be grateful if someone has any ideas? Thanks guys, O. O.
P.S. http://download.wikimedia.org/tools/ does not have the source of MWDumper. I thought this was open source?