Since Domas keeps complaining about the database load from the dumps (and then killing the dump processes), I've made some changes which should reduce the load involved.
Dumps are now being generated on a two-pass system. The first pass reads through page and revision quickly and makes a stub dump, with rev_text_id references in place of the full page text.
The second pass reads this stub dump, and the previous full dump of the same database. Existing revision text can be copied directly from the previous dump (page contents on a given revision ID are immutable). New revisions not in the old dump are read individually out of the database, using the rev_text_id to avoid having to hit the page or revision tables.
At the moment I'm doing the full/current/articles split on the first pass, and the compression to bzip2 and 7zip is done on the second pass with the final data.
Hopefully this should go a little smoother.
Also, last week the mwdumper dump import tool got a number of optimizations:
* Inserts are batched more efficiently for bulk insert * Folke Behrens sent a patch to rearrange and properly buffer things to significantly speed up the XML input and SQL generation * You can have it connect directly to the MySQL server if you have the MySQL Connector/J driver in classpath. * There are some hints in the README on server configuration tweaks for faster import
A precompiled .jar of the current code is available at: http://download.wikipedia.org/tools/
Source is in CVS, module mwdumper.
It's known to work with Sun's 1.5 JDK and GNU GCJ 4.0.1. Sun Java 1.4 may have problems with some dumps (known to fail on the last Japanese Wikipedia dump.)
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
The second pass reads this stub dump, and the previous full dump of the same database. Existing revision text can be copied directly from the previous dump (page contents on a given revision ID are immutable).
The thing I completely forgot to mention about this is that I'm using the new XMLReader extension in PHP 5.1 for this; so srv35 and srv36 have experimental PHP 5.1.0RC1 installations in /usr/local/php5 that get used for this step.
XMLReader has a 'pull' interface, so you can read off the XML stream at your own pace. Quite handy when you're already trapped in one SAX event loop reading the first stream. :)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org