What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".
~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.
There are roughly three major areas to fix:
Fail-safe
Emergency notifications
General monitoring procedure
Fail-safe:
Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.
No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.
MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.
A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:
a) Make it impossible to corrupt the data and de-sync the replication stream
b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.
If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.
We talked about this the last time this happened, but it never got implemented. Time to fix it!
Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.
Emergency notifications -- the boy who cried wolf:
Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.
I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.
Notifications like "disk space will be used up in a month" should stick with the general monitoring.
Monitoring procedure:
We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."
These are issues we know about, but haven't yet built up the procedures.
"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----