On Tue, Feb 24, 2009 at 6:49 AM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann rlullmann@gmail.com wrote:
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it?
Constructive suggestions for improvement are far more welcome than complaints and outrage.
If you have no suggestions for improvement, it is perhaps more prudent to express concern that dumps are not working and to wait for a response. This is admittedly less fun than piecing together information and "lining up" those responsible for something not being operational.
Andrew: this is NOT FUN AT ALL. Do you think it is "fun" to have to complain bitterly and repeatedly because simply reporting critical-down problems elicits little or no reply and no corrective action for days and weeks? Fun? Fun?
Okay, I'll put it this way: the following should be done:
All servers should be monitored, on several levels (ping, various queries, checking processes)
Someone should be "watching" the monitor 24x7. (being right there, or by SMS, whatever ;)
When a server is reported down (in this case hard; won't reply to ping) it should be physically looked at within minutes.
If it has no AC power, the circuit breaker is the first thing to check.
When restarted, the things it was doing should be restarted (this has not been done yet at this writing).
Now I can say these things as "constructive suggestions", but are they are not of course: they are fundamental operational procedure for a data centre. Please explain to me why I should have to "suggest" them? Eh? I am confused (seriously! I am not being snarky here). What is going on?
best, Robert