Brion Vibber wrote 2012-11-21 23:20:
While generating a full dump, we're holding the database connection open.... for a long, long time. Hours, days, or weeks in the case of English Wikipedia.
There's two issues with this:
- the DB server needs to maintain a consistent snapshot of data since
when we started the connection, so it's doing extra work to keep old data around
- the DB connection needs to actually remain open; if the DB goes
down or the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file with a consistent snapshot as quickly as possible. We get to let the databases go, and the second pass can die and restart as many times as it needs while fetching actual text, which is immutable (thus no worries about consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
Oh, thanks, now I understand! But the revisions are also immutable - isn't it simpler just to select maximum revision ID in the beginning of dump and just discard newer page and image revisions during dump generation?
Also, I have the same question about 'spawn' feature of backupTextPass.inc :) what is it intended for? :)