Same old story, disk full on a core master server (ixia) caused binlogs to stop 10 minutes before the issue was noticed and I switched it into read-only mode. Writes continued during those 10 minutes.
I'm resyncing from the master, the s2 wikis are in read-only mode while that happens, it seems to be taking about 1.5 hours in total.
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.
If it had happened an hour later, I would have been in bed, and nobody else was around. The users in #wikimedia-tech tell me they would have waited for hours before trying to phone anyone. So we need out-of-hours response procedures as well.
I think we need: * A systems checklist to be checked daily, independently by two different people and cross-checked weekly; * An SMS paging system for out-of-hours response, both automated and manual (user-driven).
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.
There are roughly three major areas to fix:
1) Fail-safe
2) Emergency notifications
3) General monitoring procedure
Fail-safe:
Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.
No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.
MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.
A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:
a) Make it impossible to corrupt the data and de-sync the replication stream
b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.
If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.
We talked about this the last time this happened, but it never got implemented. Time to fix it!
Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.
Emergency notifications -- the boy who cried wolf:
Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.
I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.
Notifications like "disk space will be used up in a month" should stick with the general monitoring.
Monitoring procedure:
We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."
These are issues we know about, but haven't yet built up the procedures.
"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.
- -- brion
What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".
~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.
There are roughly three major areas to fix:
Fail-safe
Emergency notifications
General monitoring procedure
Fail-safe:
Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.
No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.
MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.
A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:
a) Make it impossible to corrupt the data and de-sync the replication stream
b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.
If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.
We talked about this the last time this happened, but it never got implemented. Time to fix it!
Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.
Emergency notifications -- the boy who cried wolf:
Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.
I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.
Notifications like "disk space will be used up in a month" should stick with the general monitoring.
Monitoring procedure:
We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."
These are issues we know about, but haven't yet built up the procedures.
"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----
On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?
On Sun, Oct 12, 2008 at 5:46 PM, Daniel Friesen dan_the_man@telus.netwrote:
What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".
~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in
having
monitoring if nobody is watching the output.
There are roughly three major areas to fix:
Fail-safe
Emergency notifications
General monitoring procedure
Fail-safe:
Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.
No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.
MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.
A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:
a) Make it impossible to corrupt the data and de-sync the replication
stream
b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.
If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.
We talked about this the last time this happened, but it never got implemented. Time to fix it!
Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.
Emergency notifications -- the boy who cried wolf:
Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.
I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.
Notifications like "disk space will be used up in a month" should stick with the general monitoring.
Monitoring procedure:
We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."
These are issues we know about, but haven't yet built up the procedures.
"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Dan Collins wrote:
On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?
Apparently in order to be able to update things through the nagios UI, we needed to enable password protection. There's probably some sane way of still allowing read-only visitors without demanding a password, though.
- -- brion
Brion Vibber wrote:
Dan Collins wrote:
On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?
Apparently in order to be able to update things through the nagios UI, we needed to enable password protection. There's probably some sane way of still allowing read-only visitors without demanding a password, though.
Apparently the nagios developers are so confident that nagios's command interface has arbitrary shell execution vulnerabilities that they go to extreme lengths to prevent you from enabling it in an environment without password protection.
I would chalk it up to paranoia, except that Nagios NRPE has a similar protection against enabling parameters to check commands, and it turns out that those parameters are indeed passed through to the shell without proper escaping.
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
Apparently the nagios developers are so confident that nagios's command interface has arbitrary shell execution vulnerabilities that they go to extreme lengths to prevent you from enabling it in an environment without password protection.
I would chalk it up to paranoia, except that Nagios NRPE has a similar protection against enabling parameters to check commands, and it turns out that those parameters are indeed passed through to the shell without proper escaping.
Ah, Nagios, how do I love/hate thee? Let me count the ways!
- -- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Daniel Friesen wrote:
What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".
That might be slightly trickier at the moment, since it could need a central place to read that from, which isn't on an NFS server. :)
A memcache key or something might not be too evil.
- -- brion
wikitech-l@lists.wikimedia.org