Hello,
I have used Special:Export at en.wikipedia to export "Diabetes_mellitus" and ticked the box "include templates" (I'm only really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1 Exception in thread "main" java.io.IOException: XML document structures must start and end within the same entity. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source) Caused by: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity. at org .apache .xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) at org .apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org .apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl $FragmentContentDispatcher.dispatch(Unknown Source) at org .apache .xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:176) ... 2 more
Can anyone please advise? After some googling the only advice I managed to find was:
"Before you start, try clearing the tables that mwdumper works in:
DELETE FROM page; DELETE FROM revision; DELETE FROM text; "
I have done this and tried again, but the same error continues.
Many thanks, Dawson
Dawson schrieb:
Hello,
I have used Special:Export at en.wikipedia to export "Diabetes_mellitus" and ticked the box "include templates" (I'm only really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1
This happens when the XML dump contains the same page twice (or was it the same revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't crash and burn.
I don't know a goos way around this, really, sorry. The question is: *why* does the dump include the same page twice? Is that legal in terms of the dump format? If yes, why can't mwdumper cope with it?
-- daniel
I figured I would go into the XML and manually remove the offending duplicate page/revision, but couldn't find it.
I have gone from top to bottom of the XML file and find no template information, even though "include templates" was ticked.
I know it's a lot to ask, but could you take a quick look Daniel? http://dawson.md/Wikipedia-20090113203939.xml.zip (XML/1.9mb)
Basically, I'm working on a wiki project that stores information about diseases and I just want to use wikipedia's Template:Infobox_Disease. I tried to download it manually and all associated templates and transcended template files but this was just too complicated and would of taken forever. Someone on the list suggested I use Special:Export and tick the "include templates" box. This is where I'm now up to.
All suggestions/help welcomed.
Thank you, Dawson
On 15 Jan 2009, at 12:22, Daniel Kinzler wrote:
Dawson schrieb:
Hello,
I have used Special:Export at en.wikipedia to export "Diabetes_mellitus" and ticked the box "include templates" (I'm only really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1
This happens when the XML dump contains the same page twice (or was it the same revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't crash and burn.
I don't know a goos way around this, really, sorry. The question is: *why* does the dump include the same page twice? Is that legal in terms of the dump format? If yes, why can't mwdumper cope with it?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Dawson schreef:
I figured I would go into the XML and manually remove the offending duplicate page/revision, but couldn't find it.
I have gone from top to bottom of the XML file and find no template information, even though "include templates" was ticked.
How about searching for "45970", which is the duplicate ID mwdumper complained about?
Roan Kattouw (Catrope)
Hello Roan,
I did try this but it only occurs once:
<revision> <id>45970</id> <timestamp>2002-03-17T04:46:17Z</timestamp> <contributor> <username>Redmist</username> <id>307</id> </contributor> <minor/> <comment>*</comment> <text xml:space="preserve">See [[Diabetes]].</text> </revision>
Feel free to checkout http://dawson.md/Wikipedia-20090113203939.xml.zip(XML/1.9mb) and see my last reply.
Thanks, Dawson
On 15 Jan 2009, at 12:49, Roan Kattouw wrote:
Dawson schreef:
I figured I would go into the XML and manually remove the offending duplicate page/revision, but couldn't find it.
I have gone from top to bottom of the XML file and find no template information, even though "include templates" was ticked.
How about searching for "45970", which is the duplicate ID mwdumper complained about?
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thinking that perhaps it's the revisions causing the problem, I have returned to Special:Export for "Diabetes_mellitus" and this time ticked:
Include only the current revision, not the full history Include templates Save as file
The output file is dramatically smaller, 280kb (due to not including revisions), however I'm still getting a similar error:
"3 pages (127.413/sec), 33 revs (127.413/sec)
ERROR 1062 (23000) at line 31: Duplicate entry '264148315' for key 1"
Dawson
On 15 Jan 2009, at 12:22, Daniel Kinzler wrote:
Dawson schrieb:
Hello,
I have used Special:Export at en.wikipedia to export "Diabetes_mellitus" and ticked the box "include templates" (I'm only really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1
This happens when the XML dump contains the same page twice (or was it the same revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't crash and burn.
I don't know a goos way around this, really, sorry. The question is: *why* does the dump include the same page twice? Is that legal in terms of the dump format? If yes, why can't mwdumper cope with it?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Solution:
With the file now being only 280kb I can use Special:Import instead of mwdumper.jar, which works as expected:
" * All revisions were previously imported.
Import finished! "
So this was a problem with mwdumper *shrug*, oh well.
Thanks for all your help, Dawson
On 15 Jan 2009, at 12:22, Daniel Kinzler wrote:
Dawson schrieb:
Hello,
I have used Special:Export at en.wikipedia to export "Diabetes_mellitus" and ticked the box "include templates" (I'm only really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1
This happens when the XML dump contains the same page twice (or was it the same revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't crash and burn.
I don't know a goos way around this, really, sorry. The question is: *why* does the dump include the same page twice? Is that legal in terms of the dump format? If yes, why can't mwdumper cope with it?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org