Hi,
I have been importing the English Wikipeida XML Dumps every few
months (last time I did this was in June). I then used xml2sql and it
always worked for me. Now I attempted the import on the latest dump
enwiki-20090920-pages-articles.xml (and on the dump from
enwiki-20090810-pages-articles.xml), both of these have the error:
$ xml2sql enwiki-20090920-pages-articles.xml
unexpected element <redirect>
xml2sql: parsing aborted at line 33 pos 16.
So then I try mwdumper and after 1.4 M Pages, it craps out:
……
1,423,000 pages (957.283/sec), 1,423,000 revs (957.283/sec)
1,424,000 pages (957.465/sec), 1,424,000 revs (957.465/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
contributor
at
org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
at
org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at
org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
I tried the importDump.php and I get errors of the kind (MediaWiki 1.14.0)
…
Warning: xml_parse(): Unable to call handler in_() in
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler in_() in
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler out_() in
/var/www/includes/Import.php on line 437
….
(Sorry I don’t know where this error starts, but it processes a few
thousand pages, up till I get sick of looking at it before failing.)
Any ideas if the format of the XML files have changed because I can
swear that as of June or may be May, I had xml2sql working. I know that
I might need to upgrade MediaWiki to 1.15, however importDump.php
usually does not work for the English Wikipedia anyways.
I would be grateful if someone has any ideas?
Thanks guys,
O. O.
P.S.
http://download.wikimedia.org/tools/ does not have the source of
MWDumper. I thought this was open source?