Re: [Wikitech-l] Dump processes seem to be dead

24 Feb 2009

      Hmm:
On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau russblau@hotmail.com wrote:
...

Within the last hour, the server log at

http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found
and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker
was tripped in the data center.
So we conclude that
Feb 12th: a breaker trips, taking four servers offline
(8 days go by, with a number of reports)
Feb 20th: it is noted that srv31 is down, (noted that AC is off?)
(3 days go by)
Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours
later, the dumps have not resumed)
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows
within a minute or two that 4 servers are offline, and not scheduled
to be. In the next 5 minutes someone looks at the server(s), notes
that there is no AC power, walks directly to the panel and resets the
breaker. How is this *not* done? I'm sorry, I just don't get it. I've
run data centres, and it just is not possible to have servers down for
AC power for more than a few minutes unless there is a fault one can't
locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely
beyond the resource available to manage it?
Best regards,
Robert

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dump processes seem to be dead