Is there a latent hardware solution necessary?; that is, was the problem a function of size as well as cleanup, or just the cleanup?
On 1/19/07, Brion Vibber brion@pobox.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
== Summary ==
Full disk on the database master for non-en.wikipedia.org made most of our wikis uneditable for about 2.5 hours, during Europe midday / US morning.
Immediate problem is repaired; some minor further cleanup needed; procedural changes recommended.
== Disruption and data loss ==
The last edits to make it to the slave servers were at: 2007-01-19 11:50:29 UTC
A few more made it through on samuel before it stopped accepting more data up to: 2007-01-19 11:51:03 (23 broken edits on de.wikipedia.)
After that point the database didn't accept more writes, leaving a read-only state which didn't allow any further consistency problems to develop.
There _may_ be some minor problems related to caching of revision data where ID numbers overlap from the old server, but this is unclear.
== Inspection and repair ==
I was woken up around 13:50 to take a look, informed that samuel (non-enwiki master) was out of disk space and wikis were read-only.
After a few minutes to check that the slaves were consistent and that there wasn't _too_ bad a lag between them and the master, I decided to go ahead with a master switch to adler, leaving samuel out of service until it gets re-cloned.
By 14:26 the master switch was done, and read-write service restored.
== Further work: immediate ==
If really desired, we may be able to clone the small number of 'lost' edits from samuel.
Once we no longer need samuel's data, it should have its database re-cloned from one of the slaves consistent with the new state, and it can be restored to slave service.
== Further work: long-term ==
Our procedure for monitoring disk space and cleaning up binlogs is terrible.
Low-disk warnings from Nagios are routinely ignored, in part because the thresholds seem much too high.
Binlog cleanup appears to be entirely manual and ad-hoc; there is no set schedule or assignment to do this.
The good news is this task is easy to automate.
Recommendation:
- automate cleanup of binlogs on the db masters.
- make low-disk warnings more reasonable and visible for the masters
specifically (where it really, really matters)
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFsOS1wRnhpk1wk44RAjujAKDLga9UHrs9Z5o0E6DM24puZvkSMwCeO9N0 /TIoWOSKKdUMOO3Lu5Bdn0M= =R6SD -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l