On Tue, Feb 24, 2009 at 6:49 AM, Andrew Garrett <andrew(a)werdn.us> wrote:
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann
<rlullmann(a)gmail.com> wrote:
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows
within a minute or two that 4 servers are offline, and not scheduled
to be. In the next 5 minutes someone looks at the server(s), notes
that there is no AC power, walks directly to the panel and resets the
breaker. How is this *not* done? I'm sorry, I just don't get it. I've
run data centres, and it just is not possible to have servers down for
AC power for more than a few minutes unless there is a fault one can't
locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely
beyond the resource available to manage it?
Constructive suggestions for improvement are far more welcome than
complaints and outrage.
If you have no suggestions for improvement, it is perhaps more prudent
to express concern that dumps are not working and to wait for a
response. This is admittedly less fun than piecing together
information and "lining up" those responsible for something not being
operational.
Andrew: this is NOT FUN AT ALL. Do you think it is "fun" to have to
complain bitterly and repeatedly because simply reporting
critical-down problems elicits little or no reply and no corrective
action for days and weeks? Fun? Fun?
Okay, I'll put it this way: the following should be done:
All servers should be monitored, on several levels (ping, various
queries, checking processes)
Someone should be "watching" the monitor 24x7. (being right there, or
by SMS, whatever ;)
When a server is reported down (in this case hard; won't reply to
ping) it should be physically looked at within minutes.
If it has no AC power, the circuit breaker is the first thing to check.
When restarted, the things it was doing should be restarted (this has
not been done yet at this writing).
Now I can say these things as "constructive suggestions", but are they
are not of course: they are fundamental operational procedure for a
data centre. Please explain to me why I should have to "suggest" them?
Eh? I am confused (seriously! I am not being snarky here). What is
going on?
best,
Robert