having trouble with importing XML dumps into database

List overview All Threads
Download

newer

older

Public repositories for research...

Code updated

srini＠ISchool.Berkeley.EDU

22 Jun 2009 22 Jun '09

3:59 a.m.

Hi All,

I have been trying to upload one of the latest version of the XML dumps, pages-articles.xml.bz2 from http://download.wikimedia.org/enwiki/20090604/. I dont want the front end and other things that comes with wikimedia installations, so i thought i would just create the database and upload the dump.

I tried using mwdumper, but it breaks with error. After searching a bit, I found there was a related bug filed on that issue. https://bugzilla.wikimedia.org/show_bug.cgi?id=18328 I made the changes suggested in the thread, but i cudnt build the source, as I couldnt get all the dependent libraries working in my machine.

I also tried using mwimport, that also failed due to the same problem.

any one have any suggestions to import the XML dump successfully to a mysql database ?

Thanks Srini

Show replies by date

Platonides

22 Jun 22 Jun

6:56 p.m.

New subject: having trouble with importing XML dumps into database

srini@ISchool.Berkeley.EDU wrote:

...

Hi All,

I have been trying to upload one of the latest version of the XML dumps, pages-articles.xml.bz2 from http://download.wikimedia.org/enwiki/20090604/. I dont want the front end and other things that comes with wikimedia installations, so i thought i would just create the database and upload the dump.

What exactly you don't want? I don't see what's the unneeded bloat of a mediawiki install. The created main page? The user account?

srini＠ISchool.Berkeley.EDU

8:29 p.m.

Hi,

Thanks for responding. let me try to be a little bit more clear.

I am primarily interested in extracting, what image is linked from the infobox of an article (if there is a infobox in the article page). Initially i thought of parsing the xml for this info, but then after looking around a bit, I felt it might be easier and faster to get the wikipedia data loaded into database. So that I can play around with the data a lot more.

I am working on my lab machine, where already some web applications are running. Since MediaWiki installation mentioned that I need to change some PHP settings, I was a little wary about it. Also I dont have root access to the lab machines, but I can ask my lab admin to do stuff for me when i want something.

My understanding is that I should import the data even if I install MediaWiki. And it is primarily for those who want to view the data in a wiki format. So I decided to go only with the database. I didnt use importDump.php, as it was suggested to be very slow and not advisable for large dumps in http://meta.wikimedia.org/wiki/Data_dumps. I wouldnt mind installing MediaWiki if that would help me import the data easily.

I created the database using the database layout in http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sq...

This time I downloaded a different version of the pages-articles.xml.bz2 dump from http://download.wikimedia.org/enwiki/20090618/ and tried importing using mwdumper.jar.

$ java -jar ../../lib/mwdumper.jar --format=sql:1.5 enwiki-20090618-pages-articles.xml | mysql -f -u root --default-character-set=utf-8 wikipedia

When I issued the above command the importing process crashes after a while with the following error message,

1,427,000 pages (705.771/sec), 1,427,000 revs (705.771/sec) 1,428,000 pages (705.879/sec), 1,428,000 revs (705.879/sec) Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source) ERROR 1064 (42000) at line 16355: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '''''[[Rutherfordium]]''' ('''Rf''') has no stable isotopes. A standa' at line 1

I also tried the same with mwimport.pl , it crashed with a similar error saying "invalid contributor".

Any help or suggestion for a successful import would be very helpful !

sorry for being too long ...

Thanks Srini

...

srini@ISchool.Berkeley.EDU wrote:

...
Hi All,

I have been trying to upload one of the latest version of the XML dumps, pages-articles.xml.bz2 from http://download.wikimedia.org/enwiki/20090604/. I dont want the front end and other things that comes with wikimedia installations, so i thought i would just create the database and upload the dump.

What exactly you don't want? I don't see what's the unneeded bloat of a mediawiki install. The created main page? The user account?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

9:39 p.m.

New subject: having trouble with importing XML dumps into database

srini@ISchool.Berkeley.EDU wrote:

...

Hi,

Thanks for responding. let me try to be a little bit more clear.

I am primarily interested in extracting, what image is linked from the infobox of an article (if there is a infobox in the article page). Initially i thought of parsing the xml for this info, but then after looking around a bit, I felt it might be easier and faster to get the wikipedia data loaded into database. So that I can play around with the data a lot more.

I am working on my lab machine, where already some web applications are running. Since MediaWiki installation mentioned that I need to change some PHP settings, I was a little wary about it. Also I dont have root access to the lab machines, but I can ask my lab admin to do stuff for me when i want something.

You don't need to change php settings. Unless you have a really esoteric php config Mediawiki will work fine.

...

My understanding is that I should import the data even if I install MediaWiki. And it is primarily for those who want to view the data in a wiki format. So I decided to go only with the database. I didnt use importDump.php, as it was suggested to be very slow and not advisable for large dumps in http://meta.wikimedia.org/wiki/Data_dumps. I wouldnt mind installing MediaWiki if that would help me import the data easily.

If you just want to manually parse the wikitext of the articles, don't import into a bd. Feed your program directly from the XML. It will be way faster. In the other hand, if you want mediawiki to do something with it, you'll need a mediawiki install.

...

I created the database using the database layout in http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sq...

This time I downloaded a different version of the pages-articles.xml.bz2 dump from http://download.wikimedia.org/enwiki/20090618/ and tried importing using mwdumper.jar.

$ java -jar ../../lib/mwdumper.jar --format=sql:1.5 enwiki-20090618-pages-articles.xml | mysql -f -u root --default-character-set=utf-8 wikipedia

When I issued the above command the importing process crashes after a while with the following error message,

1,427,000 pages (705.771/sec), 1,427,000 revs (705.771/sec) 1,428,000 pages (705.879/sec), 1,428,000 revs (705.879/sec) Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor

...

I also tried the same with mwimport.pl , it crashed with a similar error saying "invalid contributor".

You're right. It's bug 18328. They don't support. rev_deleted.

Lars Aronsson

23 Jun 23 Jun

3:50 a.m.

New subject: having trouble with importing XML dumps into database

srini@ISchool.Berkeley.EDU wrote:

...

I am primarily interested in extracting, what image is linked from the infobox of an article (if there is a infobox in the article page). Initially i thought of parsing the xml for this

http://meta.wikimedia.org/wiki/User:LA2/Extraktor

...

but then after looking around a bit, I felt it might be easier and faster to get the wikipedia data loaded into database.

Probably not.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

5632

Age (days ago)

5633

Last active (days ago)

wikitech-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Lars Aronsson
Platonides
srini＠ISchool.Berkeley.EDU