We've had our main fileserver (zwinger) accidentally rebooted a couple of times during configuration of new servers today, resulting in downtime.
Everybody, PLEEEEEEEASE be careful when rebooting that you're typing in the window you think you are.
And when the site *does* blow up in some new and inventive way, don't forget to log it on the admin log at http://wp.wikidev.net/Server_admin_log
That is all.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
We've had our main fileserver (zwinger) accidentally rebooted a couple of times during configuration of new servers today, resulting in downtime.
Everybody, PLEEEEEEEASE be careful when rebooting that you're typing in the window you think you are.
I changed the bash prompt for root on zwinger to something distinctive. It's all very well to say "be careful", but unfortunately our simple mammalian brains aren't designed to detect when one familiar bit of english text is replaced by another. Colours, flashing lights, pretty pictures -- these work better. They allow faster recognition and require less concentration.
-- Tim Starling
I changed the bash prompt for root on zwinger to something distinctive. It's all very well to say "be careful", but unfortunately our simple mammalian brains aren't designed to detect when one familiar bit of english text is replaced by another. Colours, flashing lights, pretty pictures -- these work better. They allow faster recognition and require less concentration.
Of course I fully agree. No human can really be blamed for this sort of error, our simple mammalian brains are simply not suitable for this type of repetitive work. (Jeronim typed reboot about 100 times that day, on purpose, on other servers.)
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
--Jimbo
Jimmy Wales wrote:
I changed the bash prompt for root on zwinger to something distinctive. It's all very well to say "be careful", but unfortunately our simple mammalian brains aren't designed to detect when one familiar bit of english text is replaced by another. Colours, flashing lights, pretty pictures -- these work better. They allow faster recognition and require less concentration.
Of course I fully agree. No human can really be blamed for this sort of error, our simple mammalian brains are simply not suitable for this type of repetitive work. (Jeronim typed reboot about 100 times that day, on purpose, on other servers.)
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
--Jimbo
Am I correct in thinking that the problem seems to be centred on Zwinger's being a central NFS server for a number of crucial read-only configuration files used by a large number of servers, and apps behaving in a peculiar (and usually disastrous) way when NFS dies?
How about just keeping local copies on each server, running rsync, rather than NFS, on the master, and using rsync to keep all the local copies on te slaves in sync? The files can even be kept "in the same place" as currently, using symbolic links. If the master falls over, all of its clients continue to work, and they will continue updating when the master is either brought back up, or replaced. A CNAME would probably be a good way of designating the master.
No new technology needed, which is rather better than my original idea of implementing reliable NFS failover at the client, which I won't go into further, other that to say that it seemed a good idea until I considered (a) the wrongness of re-inventing the wheel in a very complex way, and (b) the fact that the best way to keep the multiple redundant NFS servers in sync would be rsync...
-- Neil
Neil Harris wrote:
Am I correct in thinking that the problem seems to be centred on Zwinger's being a central NFS server for a number of crucial read-only configuration files used by a large number of servers, and apps behaving in a peculiar (and usually disastrous) way when NFS dies?
How about just keeping local copies on each server, running rsync,
[snip]
Did you see my previous mail to this list on this topic, subject "SPOF notes"?
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Neil Harris wrote:
Am I correct in thinking that the problem seems to be centred on Zwinger's being a central NFS server for a number of crucial read-only configuration files used by a large number of servers, and apps behaving in a peculiar (and usually disastrous) way when NFS dies?
How about just keeping local copies on each server, running rsync,
[snip]
Did you see my previous mail to this list on this topic, subject "SPOF notes"?
-- brion vibber (brion @ pobox.com)
Ah. I have now...
-- Neil
On 9/19/05, Jimmy Wales jwales@wikia.com wrote:
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
One reason is that some things are installed on it that aren't (afaik) installed anywhere else, like dsh(1) which is required for scap(1) and sync-file(1)
Ævar Arnfjörð Bjarmason wrote:
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
One reason is that some things are installed on it that aren't (afaik) installed anywhere else, like dsh(1) which is required for scap(1) and sync-file(1)
*nod*
This doesn't sound like a particular hard technical challenge to solve. ;-)
--Jimbo
Jimmy Wales wrote:
Ævar Arnfjörð Bjarmason wrote:
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
One reason is that some things are installed on it that aren't (afaik) installed anywhere else, like dsh(1) which is required for scap(1) and sync-file(1)
*nod*
This doesn't sound like a particular hard technical challenge to solve. ;-)
dsh is already installed on albert, and I've got a cron job set up to synchronise the node group files. I did this based on the theory that if zwinger fails, we'll use albert instead. Trouble is, albert is a SPOF too. We could set up another host as a bastion, but as Mark said on IRC, zwinger is fine as a bastion, but it's underpowered for an NFS server (no RAID). Once the next batch of DB servers arrives, we can move one of the old DB servers (e.g. bacon or suda) into NFS duty, and thereby relieve zwinger of that task.
-- Tim Starling
On 9/20/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Jimmy Wales wrote:
Ævar Arnfjörð Bjarmason wrote:
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
One reason is that some things are installed on it that aren't (afaik) installed anywhere else, like dsh(1) which is required for scap(1) and sync-file(1)
*nod*
This doesn't sound like a particular hard technical challenge to solve.
;-)
dsh is already installed on albert, and I've got a cron job set up to synchronise the node group files. I did this based on the theory that if zwinger fails, we'll use albert instead. Trouble is, albert is a SPOF too. We could set up another host as a bastion, but as Mark said on IRC, zwinger is fine as a bastion, but it's underpowered for an NFS server (no RAID). Once the next batch of DB servers arrives, we can move one of the old DB servers (e.g. bacon or suda) into NFS duty, and thereby relieve zwinger of that task.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jimmy Wales wrote:
The one thing that came to my mind is: why does anyone log into zwinger in the first place? Since it's this horribly frightening SPOF, ought we to not avoid even _looking_ at it funny?
Cause nobody care about implementing a secured bastion host and setting up proper groups. The easiest is probably move the NFS service on another couple of redundant servers.
On 19/09/05, Brion Vibber brion@pobox.com wrote:
We've had our main fileserver (zwinger) accidentally rebooted a couple of times during configuration of new servers today, resulting in downtime.
Everybody, PLEEEEEEEASE be careful when rebooting that you're typing in the window you think you are.
We had this problem a few times on the shared server (called 'museum') in our student digs, which was also acting as broadband router. Eventually, my housemate replaced 'halt' and 'reboot' with symlinks to shell scripts, to try and stop it happening again. I guess if you wanted to be *really* sure, you could use some colourful and flashy text to highlight the machine name too...
#!/bin/sh
echo 'Hey!' echo '(This is museum.)' echo -n 'Is this really the computer you want to halt? ' read X
if [ "$X" == "y" -o "$X" == "yes" ]; then echo Very well, halting... /sbin/halt else echo What were you thinking? fi
wikitech-l@lists.wikimedia.org