Hmm:
On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau russblau@hotmail.com wrote:
- Within the last hour, the server log at
http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker was tripped in the data center.
So we conclude that
Feb 12th: a breaker trips, taking four servers offline
(8 days go by, with a number of reports)
Feb 20th: it is noted that srv31 is down, (noted that AC is off?)
(3 days go by)
Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours later, the dumps have not resumed)
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it?
Best regards, Robert