On Wed, Nov 21, 2012 at 4:54 AM, <vitalif(a)yourcmc.ru> wrote:
Hello!
While working on my improvements to MediaWiki Import&Export, I've
discovered a feature that is totally new for me: 2-phase backup dump. I.e.
the first pass dumper creates XML file without page texts, and the second
pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort
of optimisation for large databases and why such method of optimisation was
chosen?
While generating a full dump, we're holding the database connection
open.... for a long, long time. Hours, days, or weeks in the case of
English Wikipedia.
There's two issues with this:
* the DB server needs to maintain a consistent snapshot of data since when
we started the connection, so it's doing extra work to keep old data around
* the DB connection needs to actually remain open; if the DB goes down or
the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file
with a consistent snapshot as quickly as possible. We get to let the
databases go, and the second pass can die and restart as many times as it
needs while fetching actual text, which is immutable (thus no worries about
consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
-- brion