Hi Rob,
I completed running mwdumper with the following command last night. It took several hours to complete. I aborted the previous importDump.php in order to run this test for you. The wiki they were run against was en.wikigadugi.org. The database name is endb.
[root@gadugi archive]# [root@gadugi archive]# [root@gadugi archive]# [root@gadugi archive]# [root@gadugi archive]# java -jar mwdumper.jar --format=sql:1.5 /wikidump/dump/enwiki-GFDL-20070206-pages-articles.xml | mysql -u root -p endb Enter password: [root@gadugi archive]# [root@gadugi archive]# [root@gadugi archive]# [root@gadugi archive]#
The mwdumper ran to completion through almost 4 million articles then exited. I then applied the command you had specified. Same result, mwdumper does not work as was previously reported on other blogs. I am running Fedora Core 5 on wikigadugi. Configuration has already been provided in previous posts. Here is the output from applying the UPDATE command via mysql.
[root@gadugi /]# [root@gadugi /]# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 30734 to server version: 5.0.18
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> use endb Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A
Database changed mysql> UPDATE page SET page_touched = 20070226080700 -> ; Query OK, 61932 rows affected (2.72 sec) Rows matched: 61932 Changed: 61932 Warnings: 0
mysql> mysql> mysql>
The server is at http://en.wikigadugi.org and as you can see if you visit the site, mwdumper fails to update any of the articles to the database (other than filling the mysql Innodb file with a lot of wasted space).
:-)
Jeff
On Tue, 2007-02-27 at 10:22 -0700, Jeff V. Merkey wrote:
I completed running mwdumper with the following command last night. It took several hours to complete. I aborted the previous importDump.php in order to run this test for you. The wiki they were run against was en.wikigadugi.org. The database name is endb.
As brion, robchurch, Simetrical, Duesentrieb, domas and others on irc can attest... mwdumper works GREAT here for me.
I've made some small changes to mwdumper's code to facilitate some minor speed tweaks out of it, but it happily imports {en,de,fr,es,da}wik{iquote,ipedia,tionary} without any problems at all so far. Once I'm satisfied with the speed, I'll add the remaining wiki* pages and projects.
I must have imported enwiki 3-dozen times over the last two weeks on two physically separate machines. I've been thoroughly testing raw import speed with mwdumper/MySQL under BSD 6.2 with/without DMA enabled, IDE drives, SATA drives, and a myriad of other combinations.
Here's what I'm doing with it:
Click "Research Projects" there and see what it looks like. Each sub-section links to a full project page with more screenshots.
It works great. You're doing something wrong if it isn't working.
David A. Desrosiers wrote:
I must have imported enwiki 3-dozen times over the last two weeks on two physically separate machines. I've been thoroughly testing raw import speed with mwdumper/MySQL under BSD 6.2 with/without DMA enabled, IDE drives, SATA drives, and a myriad of other combinations.
Different OS. I am running Linux with FedoraCore 5. I think its related to java incompatibility and/or mysql issues with Federal Core 5. The php-mysql and a lot of other subsystems were changed in the Linux distros due to licensing issues with the MySQL code moving forward.
It works great. You're doing something wrong if it isn't working.
That's a stretch to say, since we are dealing with different OS's. I have purged out the endb database completely and I am re-running the entire mwdumper test with Rob's UPDATE suggestions again. The previous database had data leftover from an importDump.php run. After the test completes, I will post the results. I am farily certain mwdumper has some incompatiblity problems with the stock Linux Distributions as it has been reported in a lot of places.
Since more people run Linux than BSD at present, it bears looking into and getting fixed and/or understood. I am happy to devote time to getting to the bottom of it for Rob and Brion.
Jeff
Different OS. I am running Linux with FedoraCore 5. I think its related to java incompatibility and/or mysql issues with Federal Core 5. The php-mysql and a lot of other subsystems were changed in the Linux distros due to licensing issues with the MySQL code moving forward.
I've been doing this on Debian since at least 2004, and recently switched to BSD because I needed some features of the OS that Linux doesn't have yet (no OS wars from the peanut gallery please, I've been running Linux for > 12 years here ;)
My development machine for the Plucker Wikipedia work is my Ubuntu Feisty laptop (a Thinkpad T42p with as much disk and RAM as it will hold, if it matters).
My BSD machine is an AMD64/4600+/4G/SATA, and it works nicely there as well (500k pages/sec., according to mwdumper).
It works great. You're doing something wrong if it isn't working.
That's a stretch to say, since we are dealing with different OS's. I have purged out the endb database completely and I am re-running the entire mwdumper test with Rob's UPDATE suggestions again. The previous database had data leftover from an importDump.php run. After the test completes, I will post the results. I am farily certain mwdumper has some incompatiblity problems with the stock Linux
^^^^^^^^^^^
Distributions as it has been reported in a lot of places.
^^^^^^^^^^^^^^^ i.e. NOT Fedora Core Xx or Red Hat... It works on Debian Unstable and Ubuntu Feisty (previously Edgy Eft, also worked there).
Please cite the "lots of places" that you found reports of mwdumper failing on stock distributions..
I'd check that your distribution of choice doesn't have some bugs filed against its MySQL packages, or perhaps your Java options aren't ideal for your environment.
Incidentally, you can just redirect mwdumper's output to a .sql file and use mysql directly to import that if you wish. Both work perfectly here, albeit without any sort of feedback from the redirect method..
mwdumper [options] > big_ass_file.sql mysql mediawiki < big_ass_file.sql
Since I don't know the configuration you're using for MySQL, I can't say whether or not you can optimize it. The defaults are most-assuredly *NOT* optimal for a Wikipedia import, period. You need to make quite a few changes to optimize it for bulk inserts ala mwdumper.
Since more people run Linux than BSD at present, it bears looking into and getting fixed and/or understood. I am happy to devote time to getting to the bottom of it for Rob and Brion.
That's an opinion, not fact. BSD runs on a lot more machines than you realize, and not just desktops. Ahem, I digress.
It works here on Linux and BSD, and has since at least 2004 when I started this. I have irc logs going back that far that can validate this over and over if you wish.
If you want, I can install FC5 here in VMware and give it a try, but I suspect as long as those packages are current, it will probably work.
wikimedia-l@lists.wikimedia.org