Read-only on s2 wikis

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [MediaWiki-CVS]...

Bugzilla Weekly Report

Tim Starling

12 Oct 2008 12 Oct '08

6:52 p.m.

Same old story, disk full on a core master server (ixia) caused binlogs to stop 10 minutes before the issue was noticed and I switched it into read-only mode. Writes continued during those 10 minutes.

I'm resyncing from the master, the s2 wikis are in read-only mode while that happens, it seems to be taking about 1.5 hours in total.

The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.

I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.

If it had happened an hour later, I would have been in bed, and nobody else was around. The users in #wikimedia-tech tell me they would have waited for hours before trying to phone anyone. So we need out-of-hours response procedures as well.

I think we need: * A systems checklist to be checked daily, independently by two different people and cross-checked weekly; * An SMS paging system for out-of-hours response, both automated and manual (user-driven).

-- Tim Starling

Show replies by date

Brion Vibber

13 Oct 13 Oct

12:13 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Tim Starling wrote:

...

The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.

I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.

There are roughly three major areas to fix:

1) Fail-safe

2) Emergency notifications

3) General monitoring procedure

Fail-safe:

Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.

No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.

MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.

A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:

a) Make it impossible to corrupt the data and de-sync the replication stream

b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.

If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.

We talked about this the last time this happened, but it never got implemented. Time to fix it!

Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.

Emergency notifications -- the boy who cried wolf:

Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.

I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.

Notifications like "disk space will be used up in a month" should stick with the general monitoring.

Monitoring procedure:

We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."

These are issues we know about, but haven't yet built up the procedures.

"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----

Daniel Friesen

12:46 a.m.

What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".

~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)

Brion Vibber wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Tim Starling wrote:

...
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.

I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.

There are roughly three major areas to fix:

Fail-safe

Emergency notifications

General monitoring procedure

Fail-safe:

Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.

No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.

MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.

A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:

a) Make it impossible to corrupt the data and de-sync the replication stream

b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.

If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.

We talked about this the last time this happened, but it never got implemented. Time to fix it!

Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.

Emergency notifications -- the boy who cried wolf:

Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.

I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.

Notifications like "disk space will be used up in a month" should stick with the general monitoring.

Monitoring procedure:

We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."

These are issues we know about, but haven't yet built up the procedures.

"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.

-- brion

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----

Dan Collins

12:48 a.m.

On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?

On Sun, Oct 12, 2008 at 5:46 PM, Daniel Friesen dan_the_man@telus.netwrote:

...

What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".

~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)

Brion Vibber wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Tim Starling wrote:

...
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.

I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in

having

...
...
monitoring if nobody is watching the output.

There are roughly three major areas to fix:

Fail-safe

Emergency notifications

General monitoring procedure

Fail-safe:

Elevators have emergency breaks, water heaters have pressure release valves, and electric grids have circuit breakers.

No matter how good your procedures are, it's a simple fact that human error or a surprise situation will lead to something being missed. A good system is built so that if it fails, it is much more likely to fail in a safe manner than a catastrophic one.

MySQL fails unsafely in the disk-full case: changes are still written into InnoDB, but binlogs cannot be updated, so replication fails and there's no way to re-sync slaves without getting real ugly.

A fail-safe mode would be for it to switch into read-only mode when the disk fills up. This would:

a) Make it impossible to corrupt the data and de-sync the replication

stream

...
b) Make it impossible to miss -- the site will be rejecting edits very visibly, triggering immediate requests for sysadmin intervention.

If we can't get this fixed in MySQL itself, it should be easy to rig up a disk space monitor which will put MySQL into read-only mode if disk free is allowed to fall below a threshold.

We talked about this the last time this happened, but it never got implemented. Time to fix it!

Once a failsafe is in, then we can worry about the details of ongoing maintenance like, oh, making sure old binlogs are rotated automatically instead of when someone remembers to do it.

Emergency notifications -- the boy who cried wolf:

Our Nagios monitoring system spews lots of notifications around, but it spews *far too many* notifications, most of which are utterly unimportant. As a result, we keep pruning out how often we see them so they don't annoy us, and nobody notices when something important finally comes up.

I would thus recommend that only absolute emergency situations, such as the triggering of a fail-safe shutdown, should do something like an SMS blast to the sysadmins.

Notifications like "disk space will be used up in a month" should stick with the general monitoring.

Monitoring procedure:

We have lots of shiny graphs showing us how much CPU is in use, how much IO is going on, how much disk space is free, etc. But indeed, we don't have a solid habit of "check the free space every week and decide when we need to get more."

These are issues we know about, but haven't yet built up the procedures.

"Disk full" in this case isn't something we need to explicitly check for every day; ongoing usage is predictable, and automated reports should be able to tell us how long we have before we hit a "fix it now" threshold.

-- brion

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjyaHIACgkQwRnhpk1wk47cGgCgrxMbx5FfNDhrEygdGqDzxcPX HuEAn0v7JwHaA3g8SSsyIqvUtW/bUTyA =8mU0 -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- DCollins/ST47 Administrator, en.wikipedia.org Channel Operator, irc.freenode.net/#wikipedia Maintainer, Perlwikipedia module

Brion Vibber

12:50 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Dan Collins wrote:

...

On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?

Apparently in order to be able to update things through the nagios UI, we needed to enable password protection. There's probably some sane way of still allowing read-only visitors without demanding a password, though.

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkjycRQACgkQwRnhpk1wk44VowCgtGhAf72c5+r9BjxSIDvDgT4J d4MAn2zo7CQ0fCnPH2IBkldXBIGoeRCe =okWv -----END PGP SIGNATURE-----

Tim Starling

5:55 a.m.

Brion Vibber wrote:

...

Dan Collins wrote:

...
On a similar topic, why is nagios.wikimedia.org behind a password now? Does it really need to be secured?

Apparently in order to be able to update things through the nagios UI, we needed to enable password protection. There's probably some sane way of still allowing read-only visitors without demanding a password, though.

Apparently the nagios developers are so confident that nagios's command interface has arbitrary shell execution vulnerabilities that they go to extreme lengths to prevent you from enabling it in an environment without password protection.

I would chalk it up to paranoia, except that Nagios NRPE has a similar protection against enabling parameters to check commands, and it turns out that those parameters are indeed passed through to the shell without proper escaping.

-- Tim Starling

Brion Vibber

8:03 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Tim Starling wrote:

...

Apparently the nagios developers are so confident that nagios's command interface has arbitrary shell execution vulnerabilities that they go to extreme lengths to prevent you from enabling it in an environment without password protection.

I would chalk it up to paranoia, except that Nagios NRPE has a similar protection against enabling parameters to check commands, and it turns out that those parameters are indeed passed through to the shell without proper escaping.

Ah, Nagios, how do I love/hate thee? Let me count the ways!

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkjy1rcACgkQwRnhpk1wk47uTQCfcXQ8TOq4EeY9fSr6LFlnsd0a RicAn1akCzvJ8KgCAhMfB5AeFW7StPaI =ItwG -----END PGP SIGNATURE-----

Brion Vibber

12:49 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Daniel Friesen wrote:

...

What about triggering MediaWiki's internal read-only mode as well? Then rather than visible cryptic errors about a locked sql database, we'd have MediaWiki's read -only mode warnings, and could have a nice ro-mode message like "Disk space has reached an unsafe level...".

That might be slightly trickier at the moment, since it could need a central place to read that from, which isn't on an NFS server. :)

A memcache key or something might not be too evil.

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkjycN0ACgkQwRnhpk1wk47vYwCdFWVwnbL4qY+Co/2dSckQnEYd w4MAnia2p/FYXoJzMV/q7aSzbpwALb/B =iIU/ -----END PGP SIGNATURE-----

5891

Age (days ago)

5892

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Dan Collins
Daniel Friesen
Tim Starling