Lars Aronsson wrote: [snip]
What I want is the dump script to be rewritten so it prioritizes those databases (websites) that haven't been successfully dumped in a long time. It seems unfair that the French should have one every fortnight when us Swedes are waiting almost two months.
Currently the system sorts by order of last dump *attempt*, as I recall.
Other than this, I want the dumps to fail less often. Why do they fail? Has this been investigated? What can be done to help this?
There are several common reasons, which get addressed as they get investigated...
* Loss of database connection during run
This was the traditional problem.
Due to length of runs for big wikis, it became relatively common for the biggest wikis to fail to finish because some DB server broke or maintenance had to be done before it was done. When the server went down, the process would just die.
The first level of this was worked around last year by improving the reconnection behavior when individual connections would go down.
A second level of failures was then discovered, and was worked around a couple months ago by breaking the text fetching for the slowest parts of the dump into a subprocess which can be restarted, connecting to another server. This allows the system to recover even if the set of available DB servers has changed, since it is able to reload its configuration.
A hanging issue with that code, where the recovery system would get confused and go into a loop instead of bailing out gracefully, was discovered recently and fixed.
* Transitory hardware errors
For a while we had several breakages due to benet, the server dumps ran on, encountering disk errors which hung the system. The machine was replaced some weeks ago.
* Transitory configuration errors
Bad copy of code, broken PHP install, change in MySQL priveleges, etc. These will cause a rash of scary-looking "failure!"s in a row, but are easily fixed case-by-case, and the next runs continue just fine.
* Full disk
This still happens occasionally, breaking dumps until space is resolved; the dump system doesn't have a very good disk-space-management scheme. It can optionally delete old dumps after a few runs, but this is currently disabled as it doesn't distinguish between good and bad dumps. :)
Dumps are currently sharing space with upload backups; we're waiting on delivery of new fileservers with more space available.
Note that the dump monitor script is available in our SVN, and patches are welcome:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
-- brion vibber (brion @ wikimedia.org)