Re: [Wikitech-l] Question about 2-phase dump

21 Nov 2012

      Brion Vibber wrote 2012-11-21 23:20:
...
While generating a full dump, we're holding the database connection
open.... for a long, long time. Hours, days, or weeks in the case of
English Wikipedia.
There's two issues with this:

the DB server needs to maintain a consistent snapshot of data since

when
we started the connection, so it's doing extra work to keep old data 
around

the DB connection needs to actually remain open; if the DB goes

down or
the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a 
file
with a consistent snapshot as quickly as possible. We get to let the
databases go, and the second pass can die and restart as many times 
as it
needs while fetching actual text, which is immutable (thus no worries 
about
consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
Oh, thanks, now I understand!
But the revisions are also immutable - isn't it simpler just to select 
maximum revision ID in the beginning of dump and just discard newer page 
and image revisions during dump generation?
Also, I have the same question about 'spawn' feature of 
backupTextPass.inc :) what is it intended for? :)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Question about 2-phase dump