On Wed, Nov 21, 2012 at 4:54 AM, vitalif@yourcmc.ru wrote:
Hello!
While working on my improvements to MediaWiki Import&Export, I've discovered a feature that is totally new for me: 2-phase backup dump. I.e. the first pass dumper creates XML file without page texts, and the second pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort of optimisation for large databases and why such method of optimisation was chosen?
While generating a full dump, we're holding the database connection open.... for a long, long time. Hours, days, or weeks in the case of English Wikipedia.
There's two issues with this: * the DB server needs to maintain a consistent snapshot of data since when we started the connection, so it's doing extra work to keep old data around * the DB connection needs to actually remain open; if the DB goes down or the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file with a consistent snapshot as quickly as possible. We get to let the databases go, and the second pass can die and restart as many times as it needs while fetching actual text, which is immutable (thus no worries about consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
-- brion