Brion Vibber wrote:
On some quick testing it looks like there are some encoding problems if UTF-8 isn't the locale charset; I'll try and get those worked out.
In the meantime, try setting LANG=en_US.UTF-8 and rerunning it.
Fixed version of mwdumper available: http://download.wikimedia.org/tools/
Thank you! The new version definitely makes a big difference, as it gets past 29,000 articles without any errors.
However, it then died after 40 minutes with this error message: ============================================================= 637,000 pages (272.057/sec), 637,000 revs (272.057/sec) 638,000 pages (272.21/sec), 638,000 revs (272.21/sec) 639,000 pages (272.254/sec), 639,000 revs (272.254/sec) 640,000 pages (272.402/sec), 640,000 revs (272.402/sec) 641,000 pages (272.203/sec), 641,000 revs (272.203/sec) 642,000 pages (272.332/sec), 642,000 revs (272.332/sec) 643,000 pages (272.476/sec), 643,000 revs (272.476/sec) 644,000 pages (272.514/sec), 644,000 revs (272.514/sec) 645,000 pages (272.676/sec), 645,000 revs (272.676/sec) 646,000 pages (272.746/sec), 646,000 revs (272.746/sec) 647,000 pages (272.891/sec), 647,000 revs (272.891/sec) 648,000 pages (272.927/sec), 648,000 revs (272.927/sec) 649,000 pages (273.067/sec), 649,000 revs (273.067/sec) 650,000 pages (273.11/sec), 650,000 revs (273.11/sec) 651,000 pages (273.274/sec), 651,000 revs (273.274/sec) 652,000 pages (273.416/sec), 652,000 revs (273.416/sec) 653,000 pages (273.401/sec), 653,000 revs (273.401/sec) 654,000 pages (273.614/sec), 654,000 revs (273.614/sec) 655,000 pages (273.716/sec), 655,000 revs (273.716/sec) Exception in thread "main" java.lang.OutOfMemoryError: Java heap space ERROR 1064 at line 4426: You have an error in your SQL syntax near ''<ul><li>15: 38, 20 Sep 2004 [[User:Docu|Docu]] deleted "Category:Liberal partie' at line 1 Tue Oct 25 10:44:45 EST 2005 ludo:/home/nickj/wikipedia# screendump 1 > screen1 =============================================================
(Note machine has 452324k of RAM, and 787144k of swap, and wasn't doing anything else at the time).
MySQL article count at this time was: ============================================================= mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 655000 | +----------+ 1 row in set (0.00 sec) =============================================================
As a workaround, I then tried changing the command line from: /usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki To: /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
(i.e. increased max allowed memory use to 200 Mb), then did a "delete from cur;", and then reran mwdumper.
With this, it went much further (to around 1933000 articles).
In case it helps with mwdumper, memory use during import (with the -Xmx200M arg) looks like this: =============================================================================== ludo:/home/nickj/wikipedia# top -n1 top - 12:45:30 up 2:53, 3 users, load average: 4.48, 4.48, 4.19 Tasks: 62 total, 2 running, 60 sleeping, 0 stopped, 0 zombie Cpu(s): 9.3% us, 3.0% sy, 0.0% ni, 0.0% id, 86.7% wa, 1.0% hi, 0.0% si Mem: 452324k total, 449468k used, 2856k free, 476k buffers Swap: 787144k total, 76k used, 787068k free, 270148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1694 root 24 0 384m 142m 51m S 0.0 32.4 24:53.10 java 1697 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1698 root 16 0 384m 142m 51m S 0.0 32.4 2:11.39 java 1699 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1700 root 15 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1701 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1702 root 16 0 384m 142m 51m S 0.0 32.4 0:00.04 java 1703 root 16 0 384m 142m 51m S 0.0 32.4 0:05.24 java 1704 root 16 0 384m 142m 51m S 0.0 32.4 0:05.91 java 1705 root 16 0 384m 142m 51m S 0.0 32.4 0:00.00 java 1706 root 15 0 384m 142m 51m S 0.0 32.4 0:00.16 java 573 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.05 mysqld 575 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld 576 mysql 16 0 27232 11m 5380 S 0.0 2.6 0:00.00 mysqld [...snip irrelevant processes...] ===============================================================================
and: =============================================================================== ludo:/home/nickj/wikipedia# ps auxwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND [...snip irrelevant processes...] root 823 0.0 0.2 2240 1280 tty1 Ss 09:52 0:00 -bash root 1692 0.0 0.2 2240 1280 tty1 S+ 11:28 0:00 _ -bash root 1694 31.6 33.1 393228 149888 tty1 S+ 11:28 25:08 _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1697 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1698 2.7 33.1 393228 149888 tty1 S+ 11:28 2:11 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1699 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1700 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1701 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1702 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1703 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1704 0.1 33.1 393228 149888 tty1 S+ 11:28 0:05 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1705 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1706 0.0 33.1 393228 149888 tty1 S+ 11:28 0:00 | _ /usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 root 1695 2.5 1.9 11184 8696 tty1 S+ 11:28 2:02 _ mysql enwiki ===============================================================================
At around 1933000 articles it seemed to get stuck. I left it overnight (no change), then rebooted (for good measure), and then MySQL gave strange errors for cur (e.g. "ERROR 1016: Can't open file: 'cur.MYD'. (errno: 145)"), and refused to do anything with this table. Further investigation showed that the disk partition that MySQL was using was 100% full (Doh! My bad). I'm fairly confident that if it there had been sufficient disk space that the mwdumper import would have succeeded.
By the way, I noticed that in the TODO list in the README.txt, it has:
- Include table initialization in SQL output
This is a very good idea - i.e. for 1.4 output a "CREATE TABLE IF NOT EXISTS cur (...);" before the insert statements. I'd also suggest a table cleanout option, which does "DELETE FROM cur;" for 1.4 (would be placed right after the table creation in the output, if this options is invoked). The equivalents are for 1.5 are I guess are probably CREATE TABLE IF NOT EXISTS for both 'page' and 'text', and "DELETE FROM text; DELETE FROM page;". A "--table-cleanout" or "--delete-current" or "--from-scratch" option would be very handy to automate this as part of the dump import process.
All the best, Nick.