Hmm:
On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau <russblau(a)hotmail.com> wrote:
2) Within the last hour, the server log at
http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found
and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker
was tripped in the data center.
So we conclude that
Feb 12th: a breaker trips, taking four servers offline
(8 days go by, with a number of reports)
Feb 20th: it is noted that srv31 is down, (noted that AC is off?)
(3 days go by)
Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours
later, the dumps have not resumed)
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows
within a minute or two that 4 servers are offline, and not scheduled
to be. In the next 5 minutes someone looks at the server(s), notes
that there is no AC power, walks directly to the panel and resets the
breaker. How is this *not* done? I'm sorry, I just don't get it. I've
run data centres, and it just is not possible to have servers down for
AC power for more than a few minutes unless there is a fault one can't
locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely
beyond the resource available to manage it?
Best regards,
Robert