Lars Aronsson wrote:
What I want is the dump script to be rewritten so it
those databases (websites) that haven't been successfully dumped
in a long time. It seems unfair that the French should have one
every fortnight when us Swedes are waiting almost two months.
Currently the system sorts by order of last dump *attempt*, as I recall.
Other than this, I want the dumps to fail less often.
Why do they
fail? Has this been investigated? What can be done to help this?
There are several common reasons, which get addressed as they get
* Loss of database connection during run
This was the traditional problem.
Due to length of runs for big wikis, it became relatively common for the
biggest wikis to fail to finish because some DB server broke or
maintenance had to be done before it was done. When the server went
down, the process would just die.
The first level of this was worked around last year by improving the
reconnection behavior when individual connections would go down.
A second level of failures was then discovered, and was worked around a
couple months ago by breaking the text fetching for the slowest parts of
the dump into a subprocess which can be restarted, connecting to another
server. This allows the system to recover even if the set of available
DB servers has changed, since it is able to reload its configuration.
A hanging issue with that code, where the recovery system would get
confused and go into a loop instead of bailing out gracefully, was
discovered recently and fixed.
* Transitory hardware errors
For a while we had several breakages due to benet, the server dumps ran
on, encountering disk errors which hung the system. The machine was
replaced some weeks ago.
* Transitory configuration errors
Bad copy of code, broken PHP install, change in MySQL priveleges, etc.
These will cause a rash of scary-looking "failure!"s in a row, but are
easily fixed case-by-case, and the next runs continue just fine.
* Full disk
This still happens occasionally, breaking dumps until space is resolved;
the dump system doesn't have a very good disk-space-management scheme.
It can optionally delete old dumps after a few runs, but this is
currently disabled as it doesn't distinguish between good and bad dumps. :)
Dumps are currently sharing space with upload backups; we're waiting on
delivery of new fileservers with more space available.
Note that the dump monitor script is available in our SVN, and patches
-- brion vibber (brion @ wikimedia.org