Data dump optimization - Wikitech-l

17 Oct 2005


      Since Domas keeps complaining about the database load from the dumps
(and then killing the dump processes), I've made some changes which
should reduce the load involved.
Dumps are now being generated on a two-pass system. The first pass reads
through page and revision quickly and makes a stub dump, with
rev_text_id references in place of the full page text.
The second pass reads this stub dump, and the previous full dump of the
same database. Existing revision text can be copied directly from the
previous dump (page contents on a given revision ID are immutable). New
revisions not in the old dump are read individually out of the database,
using the rev_text_id to avoid having to hit the page or revision tables.
At the moment I'm doing the full/current/articles split on the first
pass, and the compression to bzip2 and 7zip is done on the second pass
with the final data.
Hopefully this should go a little smoother.
Also, last week the mwdumper dump import tool got a number of optimizations:
* Inserts are batched more efficiently for bulk insert
* Folke Behrens sent a patch to rearrange and properly buffer things to
significantly speed up the XML input and SQL generation
* You can have it connect directly to the MySQL server if you have the
MySQL Connector/J driver in classpath.
* There are some hints in the README on server configuration tweaks for
faster import
A precompiled .jar of the current code is available at:
http://download.wikipedia.org/tools/
Source is in CVS, module mwdumper.
It's known to work with Sun's 1.5 JDK and GNU GCJ 4.0.1. Sun Java 1.4
may have problems with some dumps (known to fail on the last Japanese
Wikipedia dump.)
-- brion vibber (brion @ pobox.com)