What about triggering MediaWiki's internal read-only mode as well? Then
rather than visible cryptic errors about a locked sql database, we'd
have MediaWiki's read -only mode warnings, and could have a nice ro-mode
message like "Disk space has reached an unsafe level...".
~Daniel Friesen (Dantman, Nadir-Seen-Fire)
~Profile/Portfolio:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Tim Starling wrote:
The server was in nagios and was reporting a
critical disk full status.
I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more
technology, but rather the need for procedures. There's no point in having
monitoring if nobody is watching the output.
There are roughly three major areas to fix:
1) Fail-safe
2) Emergency notifications
3) General monitoring procedure
Fail-safe:
Elevators have emergency breaks, water heaters have pressure release
valves, and electric grids have circuit breakers.
No matter how good your procedures are, it's a simple fact that human
error or a surprise situation will lead to something being missed. A
good system is built so that if it fails, it is much more likely to fail
in a safe manner than a catastrophic one.
MySQL fails unsafely in the disk-full case: changes are still written
into InnoDB, but binlogs cannot be updated, so replication fails and
there's no way to re-sync slaves without getting real ugly.
A fail-safe mode would be for it to switch into read-only mode when the
disk fills up. This would:
a) Make it impossible to corrupt the data and de-sync the replication stream
b) Make it impossible to miss -- the site will be rejecting edits very
visibly, triggering immediate requests for sysadmin intervention.
If we can't get this fixed in MySQL itself, it should be easy to rig up
a disk space monitor which will put MySQL into read-only mode if disk
free is allowed to fall below a threshold.
We talked about this the last time this happened, but it never got
implemented. Time to fix it!
Once a failsafe is in, then we can worry about the details of ongoing
maintenance like, oh, making sure old binlogs are rotated automatically
instead of when someone remembers to do it.
Emergency notifications -- the boy who cried wolf:
Our Nagios monitoring system spews lots of notifications around, but it
spews *far too many* notifications, most of which are utterly
unimportant. As a result, we keep pruning out how often we see them so
they don't annoy us, and nobody notices when something important finally
comes up.
I would thus recommend that only absolute emergency situations, such as
the triggering of a fail-safe shutdown, should do something like an SMS
blast to the sysadmins.
Notifications like "disk space will be used up in a month" should stick
with the general monitoring.
Monitoring procedure:
We have lots of shiny graphs showing us how much CPU is in use, how much
IO is going on, how much disk space is free, etc. But indeed, we don't
have a solid habit of "check the free space every week and decide when
we need to get more."
These are issues we know about, but haven't yet built up the procedures.
"Disk full" in this case isn't something we need to explicitly check for
every day; ongoing usage is predictable, and automated reports should be
able to tell us how long we have before we hit a "fix it now" threshold.
- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org
iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX
HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA
=8mU0
-----END PGP SIGNATURE-----