Hi All,
Are other people having grief importing the new XML format database-dumps?
Today, I've just tried 3 different methods of importing the EN 20051009_pages_articles.xml.bz2 dump, and not one of them seems to work properly.
Incidentally, I have verified that the md5sum of the dump is correct, so as to eliminate downloading problems: ludo:/home/nickj/wikipedia# md5sum 20051009_pages_articles.xml.bz2 4d18ffa1550196f3a6a0abc9ebbd7d06 20051009_pages_articles.xml.bz2
------------------------------------------------------------------------------------
Method 1: Importing using ImportDump from MediaWiki 1.5.0 running on PHP 4.1.2
This I knew this one might have problems, due to the oldness of the version of PHP.
However, this one got the furthest of all the methods. It ran for 6 hours and 24 minutes, and imported around 60 percent of articles.
Something (probably PHP) has a memory leak however, as it resulted in Linux 2.6.8's Out-of-Memory killer kicking in, until it killed the script in question. The machine has 448 Mb of RAM, so it took a while for the leak to consume all the memory.
Command line was: bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | php maintenance/importDump.php
But from the overnight system log we have: Oct 21 03:05:01 ludo kernel: Out of Memory: Killed process 816 (apache). Oct 21 03:13:04 ludo kernel: Out of Memory: Killed process 817 (apache). Oct 21 03:20:41 ludo kernel: Out of Memory: Killed process 7677 (apache). Oct 21 03:23:30 ludo kernel: Out of Memory: Killed process 946 (apache). Oct 21 03:26:57 ludo kernel: Out of Memory: Killed process 7696 (apache). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 573 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 575 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 576 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 577 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 3111 (mysqld). Oct 21 06:29:24 ludo kernel: Out of Memory: Killed process 7697 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 7699 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 3110 (php).
At that point importing stopped.
------------------------------------------------------------------------------------
Method 2: Importing using ImportDump from MediaWiki 1.5.0 using a fresh PHP 4.4 STABLE CVS snapshot build (From really old, to really new).
This I though would work, but it didn't:
ludo:/var/www/hosts/local-wikipedia/wiki# bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | ~root/tmp/php-5.1-dev/php4-STABLE-200510201252/sapi/cli/php maintenance/importDump.php 100 (22.802267296596 pages/sec 22.802267296596 revs/sec) 200 (20.961060430845 pages/sec 20.961060430845 revs/sec) 300 (20.006219254115 pages/sec 20.006219254115 revs/sec) [...snip lots of progress lines...] 64000 (41.86646431353 pages/sec 41.86646431353 revs/sec) 64100 (41.87977053847 pages/sec 41.87977053847 revs/sec) 64200 (41.891992792767 pages/sec 41.891992792767 revs/sec) 64300 (41.902506473828 pages/sec 41.902506473828 revs/sec) 64400 (41.920741784615 pages/sec 41.920741784615 revs/sec) 64500 (41.937710744276 pages/sec 41.937710744276 revs/sec) 64600 (41.945053966443 pages/sec 41.945053966443 revs/sec) 64700 (41.95428629711 pages/sec 41.95428629711 revs/sec) PHP Fatal error: Call to a member function on a non-object in /var/www/hosts/local-wikipedia/wiki/includes/Article.php on line 934 ludo:/var/www/hosts/local-wikipedia/wiki#
I.e. it dies, after 13 minutes, and at around 4% of the articles.
------------------------------------------------------------------------------------
Method 3: Using the latest mwdumper (from http://download.wikimedia.org/tools/ ), plus the latest and greatest stable JRE (1.5.0_05), and converting into 1.4 format, then importing that into MySQL:
/usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051009_pages_articles.xml.bz2 | mysql enwiki
This ran without any errors, and looked really promising.
However before this, where were some 1.5 million articles (from a June SQL dump, which was the last Wikipedia dump I've been able import properly):
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 1535910 | +----------+ 1 row in set (0.00 sec)
# Then I cleared the table:
mysql> delete from cur; Query OK, 0 rows affected (4.11 sec)
# Then the above mwdumper command ran for 53 minutes before finishing, which seemed way too quick. Checking how many articles had been imported showed there was something wrong:
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29166 | +----------+ 1 row in set (0.00 sec)
I.e. less than 2% of the articles got imported.
------------------------------------------------------------------------------------
So, my question to the list is this:
What methods have you tried for importing the XML dumps? In particular what you tried that actually _worked_ ? (and by "working", I mean runs without a memory leak, runs without dying of an error message, and imports all of the articles into a database).
All the best, Nick.
Nick Jenkins wrote:
Hi All,
Are other people having grief importing the new XML format database-dumps?
Today, I've just tried 3 different methods of importing the EN 20051009_pages_articles.xml.bz2 dump, and not one of them seems to work properly.
Please check with the 20051020 dump and mwdumper. If it stops, please check what the error was (eg, key conflicts) and report that.
-- brion vibber (brion @ pobox.com)
| -----Original Message----- | From: ... Nick Jenkins | Sent: Friday, October 21, 2005 8:37 AM / | Are other people having grief importing the new XML format | database-dumps?
Well, I'm testing a little :-) for polish wiki dumps in W'XP Pro. Now with MW 1.5.0 things goes much better. With importDump.php I load 228000 pages within 19 hours! Of course it is horrible slow, in the past i loaded cur table .sql dump for 20 minutes, but importDump script now doing all the work without interruptions. Then i try mwdumper with JRE 1.5.0_5 for meta dump, and however this programm is under construction i have no problems. The converting to 1.5 sql ended for 1 minute! It's very very promising, really. For now it is all, i have no time now to continuing the tests. But my feelings are so much better then the one of september :-)
All the best, Janusz 'Ency' Dorozynski
Please check with the 20051020 dump and mwdumper.
Problem reproduced on 20051020_pages_articles.xml.bz2
Command line was: /usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
Nature of the problem: At first the number of articles goes up from 0, but then it stalls at 29155. Any articles beyond 29155 do not seem to get imported : ====================================================== mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29155 | +----------+ 1 row in set (0.00 sec)
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29155 | +----------+ 1 row in set (0.00 sec)
mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29155 | +----------+ 1 row in set (0.00 sec) ======================================================
If it stops, please check what the error was (eg, key conflicts) and report that.
Console output is as follows: ====================================================== 19,000 pages (110.119/sec), 19,000 revs (110.119/sec) 20,000 pages (110.596/sec), 20,000 revs (110.596/sec) 21,000 pages (110.747/sec), 21,000 revs (110.747/sec) 22,000 pages (110.813/sec), 22,000 revs (110.813/sec) 23,000 pages (110.643/sec), 23,000 revs (110.643/sec) 24,000 pages (111.562/sec), 24,000 revs (111.562/sec) 25,000 pages (111.3/sec), 25,000 revs (111.3/sec) 26,000 pages (111.093/sec), 26,000 revs (111.093/sec) 27,000 pages (111.733/sec), 27,000 revs (111.733/sec) 28,000 pages (111.115/sec), 28,000 revs (111.115/sec) 29,000 pages (113.361/sec), 29,000 revs (113.361/sec) ERROR 1062 at line 459: Duplicate entry '0-1_E0_m?' for key 1 30,000 pages (115.039/sec), 30,000 revs (115.039/sec) 31,000 pages (116.308/sec), 31,000 revs (116.308/sec) 32,000 pages (117.91/sec), 32,000 revs (117.91/sec) 33,000 pages (119.448/sec), 33,000 revs (119.448/sec) 34,000 pages (121.732/sec), 34,000 revs (121.732/sec) 35,000 pages (122.996/sec), 35,000 revs (122.996/sec) 36,000 pages (124.101/sec), 36,000 revs (124.101/sec) 37,000 pages (125.216/sec), 37,000 revs (125.216/sec) 38,000 pages (126.43/sec), 38,000 revs (126.43/sec) 39,000 pages (127.583/sec), 39,000 revs (127.583/sec) Mon Oct 24 19:41:15 EST 2005 ludo:/home/nickj/wikipedia# screendump 1 > screen1 ======================================================
I did not spot any other errors, and there does not seem to be a log file or equivalent.
All the best, Nick.
Console output is as follows:
19,000 pages (110.119/sec), 19,000 revs (110.119/sec) 20,000 pages (110.596/sec), 20,000 revs (110.596/sec) 21,000 pages (110.747/sec), 21,000 revs (110.747/sec) 22,000 pages (110.813/sec), 22,000 revs (110.813/sec) 23,000 pages (110.643/sec), 23,000 revs (110.643/sec) 24,000 pages (111.562/sec), 24,000 revs (111.562/sec) 25,000 pages (111.3/sec), 25,000 revs (111.3/sec) 26,000 pages (111.093/sec), 26,000 revs (111.093/sec) 27,000 pages (111.733/sec), 27,000 revs (111.733/sec) 28,000 pages (111.115/sec), 28,000 revs (111.115/sec) 29,000 pages (113.361/sec), 29,000 revs (113.361/sec) ERROR 1062 at line 459: Duplicate entry '0-1_E0_m?' for key 1 30,000 pages (115.039/sec), 30,000 revs (115.039/sec) 31,000 pages (116.308/sec), 31,000 revs (116.308/sec) 32,000 pages (117.91/sec), 32,000 revs (117.91/sec) 33,000 pages (119.448/sec), 33,000 revs (119.448/sec) 34,000 pages (121.732/sec), 34,000 revs (121.732/sec) 35,000 pages (122.996/sec), 35,000 revs (122.996/sec) 36,000 pages (124.101/sec), 36,000 revs (124.101/sec) 37,000 pages (125.216/sec), 37,000 revs (125.216/sec) 38,000 pages (126.43/sec), 38,000 revs (126.43/sec) 39,000 pages (127.583/sec), 39,000 revs (127.583/sec) Mon Oct 24 19:41:15 EST 2005 ludo:/home/nickj/wikipedia# screendump 1 > screen1 ======================================================
I did not spot any other errors, and there does not seem to be a log file or equivalent.
Just to clarify - I pressed CTRL-C at around 39000 articles as there did not seem to be any point continuing (i.e. it didn't die or anything, rather I stopped it at that point because the count of articles had stopped increasing).
All the best, Nick.
Nick Jenkins wrote:
29,000 pages (113.361/sec), 29,000 revs (113.361/sec) ERROR 1062 at line 459: Duplicate entry '0-1_E0_m?' for key 1
Looking in the database now, I see three pages with similar titles in that range:
id ns title 35982 0 1_E0_m 36017 0 1_E0_m² 36019 0 1_E0_m³
None of them should conflict, being quite distinct, which makes me suspect garbled input or output, or a garbled index configuration on MySQL.
I'm downloading the dump to test with locally now; could you provide some details of: * MySQL version and configuration (charset, etc) * system locale settings * anything else interesting
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Nick Jenkins wrote:
29,000 pages (113.361/sec), 29,000 revs (113.361/sec) ERROR 1062 at line 459: Duplicate entry '0-1_E0_m?' for key 1
Looking in the database now, I see three pages with similar titles in that range:
id ns title 35982 0 1_E0_m 36017 0 1_E0_m² 36019 0 1_E0_m³
None of them should conflict, being quite distinct, which makes me suspect garbled input or output, or a garbled index configuration on MySQL.
I can confirm that I can import the first 50k pages or so of this dump without the reported problem ocurring. I'll run the rest when it's done downloading.
* Ubuntu Linux (Breezy Badger, x86) * en_US.UTF-8 locale * MySQL 4.0.24 * table definitions from MediaWiki 1.4.11 * mwdumper current CVS (shouldn't be any different in this regard from the last uploaded snapshot) * Sun J2SE 1.5.0_05-b05
On some quick testing it looks like there are some encoding problems if UTF-8 isn't the locale charset; I'll try and get those worked out.
In the meantime, try setting LANG=en_US.UTF-8 and rerunning it.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
On some quick testing it looks like there are some encoding problems if UTF-8 isn't the locale charset; I'll try and get those worked out.
In the meantime, try setting LANG=en_US.UTF-8 and rerunning it.
Fixed version of mwdumper available: http://download.wikimedia.org/tools/
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
On some quick testing it looks like there are some encoding problems if UTF-8 isn't the locale charset; I'll try and get those worked out.
In the meantime, try setting LANG=en_US.UTF-8 and rerunning it.
Fixed version of mwdumper available: http://download.wikimedia.org/tools/
Thank you! The new version definitely makes a big difference, as it gets past 29,000 articles without any errors.
However, it then died after 40 minutes with this error message: ============================================================= 637,000 pages (272.057/sec), 637,000 revs (272.057/sec) 638,000 pages (272.21/sec), 638,000 revs (272.21/sec) 639,000 pages (272.254/sec), 639,000 revs (272.254/sec) 640,000 pages (272.402/sec), 640,000 revs (272.402/sec) 641,000 pages (272.203/sec), 641,000 revs (272.203/sec) 642,000 pages (272.332/sec), 642,000 revs (272.332/sec) 643,000 pages (272.476/sec), 643,000 revs (272.476/sec) 644,000 pages (272.514/sec), 644,000 revs (272.514/sec) 645,000 pages (272.676/sec), 645,000 revs (272.676/sec) 646,000 pages (272.746/sec), 646,000 revs (272.746/sec) 647,000 pages (272.891/sec), 647,000 revs (272.891/sec) 648,000 pages (272.927/sec), 648,000 revs (272.927/sec) 649,000 pages (273.067/sec), 649,000 revs (273.067/sec) 650,000 pages (273.11/sec), 650,000 revs (273.11/sec) 651,000 pages (273.274/sec), 651,000 revs (273.274/sec) 652,000 pages (273.416/sec), 652,000 revs (273.416/sec) 653,000 pages (273.401/sec), 653,000 revs (273.401/sec) 654,000 pages (273.614/sec), 654,000 revs (273.614/sec) 655,000 pages (273.716/sec), 655,000 revs (273.716/sec) Exception in thread "main" java.lang.OutOfMemoryError: Java heap space ERROR 1064 at line 4426: You have an error in your SQL syntax near ''<ul><li>15: 38, 20 Sep 2004 [[User:Docu|Docu]] deleted "Category:Liberal partie' at line 1 Tue Oct 25 10:44:45 EST 2005 ludo:/home/nickj/wikipedia# screendump 1 > screen1 =============================================================
(Note machine has 452324k of RAM, and 787144k of swap, and wasn't doing anything else at the time).
MySQL article count at this time was: ============================================================= mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 655000 | +----------+ 1 row in set (0.00 sec) =============================================================
As a workaround, I then tried changing the command line from: /usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki To: /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
(i.e. increased max allowed memory use to 200 Mb), then did a "delete from cur;", and then reran mwdumper.
With this, it went much further (to around 1933000 articles).
In case it helps with mwdumper, memory use during import (with the -Xmx200M arg) looks like this: =============================================================================== ludo:/home/nickj/wikipedia# top -n1 top - 12:45:30 up 2:53, 3 users, load average: 4.48, 4.48, 4.19 Tasks: 62 total, 2 running, 60 sleeping, 0 stopped, 0 zombie Cpu(s): 9.3% us, 3.0% sy, 0.0% ni, 0.0% id, 86.7% wa, 1.0% hi, 0.0% si Mem: 452324k total, 449468k used, 2856k free, 476k buffers Swap: 787144k total, 76k used, 787068k free, 270148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1694 root 24 0 384m 142m 51m S 0.0 32.4 24:53.10 java 1697 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1698 root 16 0 384m 142m 51m S 0.0 32.4 2:11.39 java 1699 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1700 root 15 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1701 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1702 root 16 0 384m 142m 51m S 0.0 32.4 0:00.04 java 1703 root 16 0 384m 142m 51m S 0.0 32.4 0:05.24 java 1704 root 16 0 384m 142m 51m S 0.0 32.4 0:05.91 java 1705 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1706 root 15 0 384m 142m 51m S 0.0 32.4 0:00.16 java 573 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.05 mysqld 575 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld 576 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld [...snip irrelevant processes...] ===============================================================================
and: =============================================================================== ludo:/home/nickj/wikipedia# ps auxwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND [...snip irrelevant processes...] root 823 0.0 0.2 2240 1280 tty1 Ss 09:52 0:00 -bash root 1692 0.0 0.2 2240 1280 tty1 S+ 11:28 0:00 _ -bash root 1694 31.6 33.1 393228 149888 tty1 S+ 11:28 25:08 _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1697 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1698 2.7 33.1 393228 149888 tty1 S+ 11:28 2:11 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1699 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1700 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1701 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1702 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1703 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1704 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1705 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1706 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1695 2.5 1.9 11184 8696 tty1 S+ 11:28 2:02 _ mysql enwiki ===============================================================================
At around 1933000 articles it seemed to get stuck. I left it overnight (no change), then rebooted (for good measure), and then MySQL gave strange errors for cur (e.g. "ERROR 1016: Can't open file: 'cur.MYD'. (errno: 145)"), and refused to do anything with this table. Further investigation showed that the disk partition that MySQL was using was 100% full (Doh! My bad). I'm fairly confident that if it there had been sufficient disk space that the mwdumper import would have succeeded.
By the way, I noticed that in the TODO list in the README.txt, it has:
- Include table initialization in SQL output
This is a very good idea - i.e. for 1.4 output a "CREATE TABLE IF NOT EXISTS cur (...);" before the insert statements. I'd also suggest a table cleanout option, which does "DELETE FROM cur;" for 1.4 (would be placed right after the table creation in the output, if this options is invoked). The equivalents are for 1.5 are I guess are probably CREATE TABLE IF NOT EXISTS for both 'page' and 'text', and "DELETE FROM text; DELETE FROM page;". A "--table-cleanout" or "--delete-current" or "--from-scratch" option would be very handy to automate this as part of the dump import process.
All the best, Nick.
Nick Jenkins wrote:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Hrmmm, memory usage should remain quite small throughout. It's possible there are bugs in the recent buffering patches, I'll check.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org