Hi,
I have been importing the English Wikipeida XML Dumps every few months (last time I did this was in June). I then used xml2sql and it always worked for me. Now I attempted the import on the latest dump enwiki-20090920-pages-articles.xml (and on the dump from enwiki-20090810-pages-articles.xml), both of these have the error:
$ xml2sql enwiki-20090920-pages-articles.xml
unexpected element <redirect> xml2sql: parsing aborted at line 33 pos 16.
So then I try mwdumper and after 1.4 M Pages, it craps out: …… 1,423,000 pages (957.283/sec), 1,423,000 revs (957.283/sec) 1,424,000 pages (957.465/sec), 1,424,000 revs (957.465/sec) Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
I tried the importDump.php and I get errors of the kind (MediaWiki 1.14.0) … Warning: xml_parse(): Unable to call handler in_() in /var/www/includes/Import.php on line 437 Warning: xml_parse(): Unable to call handler in_() in /var/www/includes/Import.php on line 437 Warning: xml_parse(): Unable to call handler out_() in /var/www/includes/Import.php on line 437 …. (Sorry I don’t know where this error starts, but it processes a few thousand pages, up till I get sick of looking at it before failing.)
Any ideas if the format of the XML files have changed because I can swear that as of June or may be May, I had xml2sql working. I know that I might need to upgrade MediaWiki to 1.15, however importDump.php usually does not work for the English Wikipedia anyways.
I would be grateful if someone has any ideas? Thanks guys, O. O.
P.S. http://download.wikimedia.org/tools/ does not have the source of MWDumper. I thought this was open source?
O. O. writes:
(Sorry I don’t know where this error starts, but it processes a few thousand pages, up till I get sick of looking at it before failing.)
Any ideas if the format of the XML files have changed because I can swear that as of June or may be May, I had xml2sql working. I know that I might need to upgrade MediaWiki to 1.15, however importDump.php usually does not work for the English Wikipedia anyways.
I would be grateful if someone has any ideas? Thanks guys, O. O.
Seems it fails on the new <redirect> tag.
P.S. http://download.wikimedia.org/tools/ does not have the source of MWDumper. I thought this was open source?
MWDumper source is available at http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/
It should be noted at the readme.
Platonides wrote:
Seems it fails on the new <redirect> tag.
P.S. http://download.wikimedia.org/tools/ does not have the source of MWDumper. I thought this was open source?
MWDumper source is available at http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/
It should be noted at the readme.
Thanks Platonides. With the new <redirect> tag is there anyway to import the new XML Files?
Could I simply strip out the <redirect /> tags from the file, if I wanted MWDumper to work. Or if I upgrade to MediaWiki 1.16, would import.php work without any problems?
(Thanks for the pointer to the source of MW Dumper. The Source is not mentioned in the Readme. However, I found it too complicated - or not well documented for me at this point.)
Thanks again, O.O.
O. O. wrote:
Platonides wrote:
Seems it fails on the new <redirect> tag.
P.S. http://download.wikimedia.org/tools/ does not have the source of MWDumper. I thought this was open source?
MWDumper source is available at http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/
It should be noted at the readme.
Thanks Platonides. With the new <redirect> tag is there anyway to import the new XML Files?
Could I simply strip out the <redirect /> tags from the file, if I wanted MWDumper to work. Or if I upgrade to MediaWiki 1.16, would import.php work without any problems?
If it's failing due to an old xsd then ..
The updated xsd and copy of Import.php just got checked into our repositories so you can either pull this
http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472
and increase the version number ala
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r...
Or you can wait till the next tagged release which will likely include this.
(Thanks for the pointer to the source of MW Dumper. The Source is not mentioned in the Readme. However, I found it too complicated - or not well documented for me at this point.)
I'll have a peek at this and see if it can be improved.
--tomasz
Tomasz Finc wrote:
O. O. wrote:
If it's failing due to an old xsd then ..
The updated xsd and copy of Import.php just got checked into our repositories so you can either pull this
http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472
and increase the version number ala
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r...
Or you can wait till the next tagged release which will likely include this.
Thanks Tomasz. I don’t mind waiting for your next release if it is going to be in the next month or so.
(Thanks for the pointer to the source of MW Dumper. The Source is not mentioned in the Readme. However, I found it too complicated - or not well documented for me at this point.)
I'll have a peek at this and see if it can be improved.
I hope someone could updated MW Dumper to the new XSD – it would help a lot as far as importing Wikipedia Dumps are concerned, because importDump.php is not practical.
Hi!
I have got the same "<redirect>" problem while importing the dump of Russian Wiktionary. :(
Best regards, Andrew Krizhanovsky.
On Fri, Oct 9, 2009 at 3:46 AM, O. O. olson_ot@yahoo.com wrote:
Tomasz Finc wrote:
O. O. wrote:
If it's failing due to an old xsd then ..
The updated xsd and copy of Import.php just got checked into our repositories so you can either pull this
http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472
and increase the version number ala
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r...
Or you can wait till the next tagged release which will likely include this.
Thanks Tomasz. I don’t mind waiting for your next release if it is going to be in the next month or so.
(Thanks for the pointer to the source of MW Dumper. The Source is not mentioned in the Readme. However, I found it too complicated - or not well documented for me at this point.)
I'll have a peek at this and see if it can be improved.
I hope someone could updated MW Dumper to the new XSD – it would help a lot as far as importing Wikipedia Dumps are concerned, because importDump.php is not practical.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Andrew Krizhanovsky wrote:
Hi!
I have got the same "<redirect>" problem while importing the dump of Russian Wiktionary. :(
Best regards, Andrew Krizhanovsky.
So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am curious to know what others are using for their imports. (This is for my personal knowledge.)
It seems that the “<redirect />” tags are mostly blank while grepping through the English Wikipedia Dump. I hope someone can fix this soon.
Thanks to you guys, O. O.
I have used xml2sql, mwdumper, import.php and the python script to import The two fastest are xml2sql and the python script (xray). The best results is from importDump.php mwDumper is slow but it gives good results.
I have not done any import with the new <redirect> tag.
bilal
On Fri, Oct 9, 2009 at 2:18 PM, O. O. olson_ot@yahoo.com wrote:
Andrew Krizhanovsky wrote:
Hi!
I have got the same "<redirect>" problem while importing the dump of Russian Wiktionary. :(
Best regards, Andrew Krizhanovsky.
So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am curious to know what others are using for their imports. (This is for my personal knowledge.)
It seems that the “<redirect />” tags are mostly blank while grepping through the English Wikipedia Dump. I hope someone can fix this soon.
Thanks to you guys, O. O.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
I have tried xml2sql and importDump.php. The same error.
Best regards, Andrew.
On Fri, Oct 9, 2009 at 10:18 PM, O. O. olson_ot@yahoo.com wrote:
Andrew Krizhanovsky wrote:
Hi!
I have got the same "<redirect>" problem while importing the dump of Russian Wiktionary. :(
Best regards, Andrew Krizhanovsky.
So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am curious to know what others are using for their imports. (This is for my personal knowledge.)
It seems that the “<redirect />” tags are mostly blank while grepping through the English Wikipedia Dump. I hope someone can fix this soon.
Thanks to you guys, O. O.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org