-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
== Summary ==
Full disk on the database master for non-en.wikipedia.org made most of our wikis uneditable for about 2.5 hours, during Europe midday / US morning.
Immediate problem is repaired; some minor further cleanup needed; procedural changes recommended.
== Disruption and data loss ==
The last edits to make it to the slave servers were at: 2007-01-19 11:50:29 UTC
A few more made it through on samuel before it stopped accepting more data up to: 2007-01-19 11:51:03 (23 broken edits on de.wikipedia.)
After that point the database didn't accept more writes, leaving a read-only state which didn't allow any further consistency problems to develop.
There _may_ be some minor problems related to caching of revision data where ID numbers overlap from the old server, but this is unclear.
== Inspection and repair ==
I was woken up around 13:50 to take a look, informed that samuel (non-enwiki master) was out of disk space and wikis were read-only.
After a few minutes to check that the slaves were consistent and that there wasn't _too_ bad a lag between them and the master, I decided to go ahead with a master switch to adler, leaving samuel out of service until it gets re-cloned.
By 14:26 the master switch was done, and read-write service restored.
== Further work: immediate ==
If really desired, we may be able to clone the small number of 'lost' edits from samuel.
Once we no longer need samuel's data, it should have its database re-cloned from one of the slaves consistent with the new state, and it can be restored to slave service.
== Further work: long-term ==
Our procedure for monitoring disk space and cleaning up binlogs is terrible.
Low-disk warnings from Nagios are routinely ignored, in part because the thresholds seem much too high.
Binlog cleanup appears to be entirely manual and ad-hoc; there is no set schedule or assignment to do this.
The good news is this task is easy to automate.
Recommendation: * automate cleanup of binlogs on the db masters. * make low-disk warnings more reasonable and visible for the masters specifically (where it really, really matters)
- -- brion vibber (brion @ pobox.com)
Is there a latent hardware solution necessary?; that is, was the problem a function of size as well as cleanup, or just the cleanup?
On 1/19/07, Brion Vibber brion@pobox.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
== Summary ==
Full disk on the database master for non-en.wikipedia.org made most of our wikis uneditable for about 2.5 hours, during Europe midday / US morning.
Immediate problem is repaired; some minor further cleanup needed; procedural changes recommended.
== Disruption and data loss ==
The last edits to make it to the slave servers were at: 2007-01-19 11:50:29 UTC
A few more made it through on samuel before it stopped accepting more data up to: 2007-01-19 11:51:03 (23 broken edits on de.wikipedia.)
After that point the database didn't accept more writes, leaving a read-only state which didn't allow any further consistency problems to develop.
There _may_ be some minor problems related to caching of revision data where ID numbers overlap from the old server, but this is unclear.
== Inspection and repair ==
I was woken up around 13:50 to take a look, informed that samuel (non-enwiki master) was out of disk space and wikis were read-only.
After a few minutes to check that the slaves were consistent and that there wasn't _too_ bad a lag between them and the master, I decided to go ahead with a master switch to adler, leaving samuel out of service until it gets re-cloned.
By 14:26 the master switch was done, and read-write service restored.
== Further work: immediate ==
If really desired, we may be able to clone the small number of 'lost' edits from samuel.
Once we no longer need samuel's data, it should have its database re-cloned from one of the slaves consistent with the new state, and it can be restored to slave service.
== Further work: long-term ==
Our procedure for monitoring disk space and cleaning up binlogs is terrible.
Low-disk warnings from Nagios are routinely ignored, in part because the thresholds seem much too high.
Binlog cleanup appears to be entirely manual and ad-hoc; there is no set schedule or assignment to do this.
The good news is this task is easy to automate.
Recommendation:
- automate cleanup of binlogs on the db masters.
- make low-disk warnings more reasonable and visible for the masters
specifically (where it really, really matters)
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFsOS1wRnhpk1wk44RAjujAKDLga9UHrs9Z5o0E6DM24puZvkSMwCeO9N0 /TIoWOSKKdUMOO3Lu5Bdn0M= =R6SD -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brad Patrick wrote:
Is there a latent hardware solution necessary?; that is, was the problem a function of size as well as cleanup, or just the cleanup?
Just cleanup. Binlogs had been accumulating since September 30 on samuel, totaling about 65 GB. Had they been cleaned up more regularly there would have been no disk shortage.
(Incidentally I removed the earliest 10GB during emergency cleanup to free up some work space on the drive.)
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org