Hi All,
Are other people having grief importing the new XML format database-dumps?
Today, I've just tried 3 different methods of importing the EN 20051009_pages_articles.xml.bz2 dump, and not one of them seems to work properly.
Incidentally, I have verified that the md5sum of the dump is correct, so as to eliminate downloading problems: ludo:/home/nickj/wikipedia# md5sum 20051009_pages_articles.xml.bz2 4d18ffa1550196f3a6a0abc9ebbd7d06 20051009_pages_articles.xml.bz2
------------------------------------------------------------------------------------
Method 1: Importing using ImportDump from MediaWiki 1.5.0 running on PHP 4.1.2
This I knew this one might have problems, due to the oldness of the version of PHP.
However, this one got the furthest of all the methods. It ran for 6 hours and 24 minutes, and imported around 60 percent of articles.
Something (probably PHP) has a memory leak however, as it resulted in Linux 2.6.8's Out-of-Memory killer kicking in, until it killed the script in question. The machine has 448 Mb of RAM, so it took a while for the leak to consume all the memory.
Command line was: bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | php maintenance/importDump.php
But from the overnight system log we have: Oct 21 03:05:01 ludo kernel: Out of Memory: Killed process 816 (apache). Oct 21 03:13:04 ludo kernel: Out of Memory: Killed process 817 (apache). Oct 21 03:20:41 ludo kernel: Out of Memory: Killed process 7677 (apache). Oct 21 03:23:30 ludo kernel: Out of Memory: Killed process 946 (apache). Oct 21 03:26:57 ludo kernel: Out of Memory: Killed process 7696 (apache). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 573 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 575 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 576 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 577 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 3111 (mysqld). Oct 21 06:29:24 ludo kernel: Out of Memory: Killed process 7697 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 7699 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 3110 (php).
At that point importing stopped.
------------------------------------------------------------------------------------
Method 2: Importing using ImportDump from MediaWiki 1.5.0 using a fresh PHP 4.4 STABLE CVS snapshot build (From really old, to really new).
This I though would work, but it didn't:
ludo:/var/www/hosts/local-wikipedia/wiki# bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | ~root/tmp/php-5.1-dev/php4-STABLE-200510201252/sapi/cli/php maintenance/importDump.php 100 (22.802267296596 pages/sec 22.802267296596 revs/sec) 200 (20.961060430845 pages/sec 20.961060430845 revs/sec) 300 (20.006219254115 pages/sec 20.006219254115 revs/sec) [...snip lots of progress lines...] 64000 (41.86646431353 pages/sec 41.86646431353 revs/sec) 64100 (41.87977053847 pages/sec 41.87977053847 revs/sec) 64200 (41.891992792767 pages/sec 41.891992792767 revs/sec) 64300 (41.902506473828 pages/sec 41.902506473828 revs/sec) 64400 (41.920741784615 pages/sec 41.920741784615 revs/sec) 64500 (41.937710744276 pages/sec 41.937710744276 revs/sec) 64600 (41.945053966443 pages/sec 41.945053966443 revs/sec) 64700 (41.95428629711 pages/sec 41.95428629711 revs/sec) PHP Fatal error: Call to a member function on a non-object in /var/www/hosts/local-wikipedia/wiki/includes/Article.php on line 934 ludo:/var/www/hosts/local-wikipedia/wiki#
I.e. it dies, after 13 minutes, and at around 4% of the articles.
------------------------------------------------------------------------------------
Method 3: Using the latest mwdumper (from http://download.wikimedia.org/tools/ ), plus the latest and greatest stable JRE (1.5.0_05), and converting into 1.4 format, then importing that into MySQL:
/usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051009_pages_articles.xml.bz2 | mysql enwiki
This ran without any errors, and looked really promising.
However before this, where were some 1.5 million articles (from a June SQL dump, which was the last Wikipedia dump I've been able import properly):
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 1535910 | +----------+ 1 row in set (0.00 sec)
# Then I cleared the table:
mysql> delete from cur; Query OK, 0 rows affected (4.11 sec)
# Then the above mwdumper command ran for 53 minutes before finishing, which seemed way too quick. Checking how many articles had been imported showed there was something wrong:
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29166 | +----------+ 1 row in set (0.00 sec)
I.e. less than 2% of the articles got imported.
------------------------------------------------------------------------------------
So, my question to the list is this:
What methods have you tried for importing the XML dumps? In particular what you tried that actually _worked_ ? (and by "working", I mean runs without a memory leak, runs without dying of an error message, and imports all of the articles into a database).
All the best, Nick.