Re: [Wikitech-l] Question about 2-phase dump

22 Nov 2012


      On Wed, Nov 21, 2012 at 4:54 AM, vitalif@yourcmc.ru wrote:
...
Hello!
While working on my improvements to MediaWiki Import&Export, I've
discovered a feature that is totally new for me: 2-phase backup dump. I.e.
the first pass dumper creates XML file without page texts, and the second
pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort
of optimisation for large databases and why such method of optimisation was
chosen?
While generating a full dump, we're holding the database connection
open.... for a long, long time. Hours, days, or weeks in the case of
English Wikipedia.
There's two issues with this:
* the DB server needs to maintain a consistent snapshot of data since when
we started the connection, so it's doing extra work to keep old data around
* the DB connection needs to actually remain open; if the DB goes down or
the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file
with a consistent snapshot as quickly as possible. We get to let the
databases go, and the second pass can die and restart as many times as it
needs while fetching actual text, which is immutable (thus no worries about
consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Question about 2-phase dump