srini(a)ISchool.Berkeley.EDU wrote:
Hi,
Thanks for responding. let me try to be a little bit more clear.
I am primarily interested in extracting, what image is linked from the
infobox of an article (if there is a infobox in the article page).
Initially i thought of parsing the xml for this info, but then after
looking around a bit, I felt it might be easier and faster to get the
wikipedia data loaded into database. So that I can play around with the
data a lot more.
I am working on my lab machine, where already some web applications are
running. Since MediaWiki installation mentioned that I need to change some
PHP settings, I was a little wary about it. Also I dont have root access
to the lab machines, but I can ask my lab admin to do stuff for me when i
want something.
You don't need to change php settings. Unless you have a really esoteric
php config Mediawiki will work fine.
My understanding is that I should import the data even
if I install
MediaWiki. And it is primarily for those who want to view the data in a
wiki format. So I decided to go only with the database. I didnt use
importDump.php, as it was suggested to be very slow and not advisable for
large dumps in
http://meta.wikimedia.org/wiki/Data_dumps. I wouldnt mind
installing MediaWiki if that would help me import the data easily.
If you just want to manually parse the wikitext of the articles, don't
import into a bd. Feed your program directly from the XML. It will be
way faster.
In the other hand, if you want mediawiki to do something with it, you'll
need a mediawiki install.
I created the database using the database layout in
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.s…
This time I downloaded a different version of the pages-articles.xml.bz2
dump from
http://download.wikimedia.org/enwiki/20090618/ and tried
importing using mwdumper.jar.
$ java -jar ../../lib/mwdumper.jar --format=sql:1.5
enwiki-20090618-pages-articles.xml | mysql -f -u root
--default-character-set=utf-8 wikipedia
When I issued the above command the importing process crashes after a
while with the following error message,
1,427,000 pages (705.771/sec), 1,427,000 revs (705.771/sec)
1,428,000 pages (705.879/sec), 1,428,000 revs (705.879/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
contributor
I also tried the same with mwimport.pl , it crashed
with a similar error
saying "invalid contributor".
You're right. It's bug 18328. They don't support. rev_deleted.