Re: [Wikitech-l] Database dumps fail in the wrong way

30 May 2008

Lars Aronsson wrote:
[snip]
...
  What I want is the dump script to be rewritten so it
prioritizes 
 those databases (websites) that haven't been successfully dumped 
 in a long time.  It seems unfair that the French should have one 
 every fortnight when us Swedes are waiting almost two months. 
Currently the system sorts by order of last dump *attempt*, as I recall.

...
  Other than this, I want the dumps to fail less often. 
Why do they 
 fail?  Has this been investigated?  What can be done to help this? 
There are several common reasons, which get addressed as they get 
investigated...

* Loss of database connection during run

This was the traditional problem.

Due to length of runs for big wikis, it became relatively common for the 
biggest wikis to fail to finish because some DB server broke or 
maintenance had to be done before it was done. When the server went 
down, the process would just die.

The first level of this was worked around last year by improving the 
reconnection behavior when individual connections would go down.

A second level of failures was then discovered, and was worked around a 
couple months ago by breaking the text fetching for the slowest parts of 
the dump into a subprocess which can be restarted, connecting to another 
server. This allows the system to recover even if the set of available 
DB servers has changed, since it is able to reload its configuration.

A hanging issue with that code, where the recovery system would get 
confused and go into a loop instead of bailing out gracefully, was 
discovered recently and fixed.

* Transitory hardware errors

For a while we had several breakages due to benet, the server dumps ran 
on, encountering disk errors which hung the system. The machine was 
replaced some weeks ago.

* Transitory configuration errors

Bad copy of code, broken PHP install, change in MySQL priveleges, etc. 
These will cause a rash of scary-looking "failure!"s in a row, but are 
easily fixed case-by-case, and the next runs continue just fine.

* Full disk

This still happens occasionally, breaking dumps until space is resolved; 
the dump system doesn't have a very good disk-space-management scheme. 
It can optionally delete old dumps after a few runs, but this is 
currently disabled as it doesn't distinguish between good and bad dumps. :)

Dumps are currently sharing space with upload backups; we're waiting on 
delivery of new fileservers with more space available.

Note that the dump monitor script is available in our SVN, and patches 
are welcome:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/

-- brion vibber (brion @ wikimedia.org)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Database dumps fail in the wrong way