[Labs-l] Outage of labs in progress (resolved)

Fri Nov 7 06:45:22 UTC 2014

That sounds good. This story provides a good illustration of the usefulness
of having spares of critical parts.

I've experienced my share of hardware mysteries. Today's mystery for me
involved a router

Thank yoy again to those of you who were the hardware medics today.

Regards,

Pine
On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <marc at uberbox.org> wrote:

> On 11/06/2014 05:13 PM, Pine W wrote:
> > It will be interesting to see the post-action report and recommendations
> > for prevention, if possible.
>
> There is, in the end, very little that can be done to prevent freak
> failures of the sort; they are thankfully rare but basically impossible
> to predict.
>
> The disk shelves have a lot of redundancy, but the two channels can be
> used either to multipath to a single server, or to wire two distinct
> servers; we chose the latter because servers - as a whole - have a lot
> more moving parts and a much shorter MTBF.  This makes us more
> vulnerable to the rarer failure of the communication path, and much less
> vunlerable to the server /itself/ having a failure of some sort.
>
> This time, we were just extremely unlucky.  Cabling rarely fails if it
> worked at all, and the chances that one would suddenly stop working
> right after a year of use is ridiculously low.  This is why it took
> quite a bit of time to even /locate/ the fault: we tried pretty much
> everything /else/ first given how improbable a cable fault is.  The
> actual fix took less than 15 minutes all told; the roughly three hours
> prior were spent trying to find the fault everywhere else first.
>
> I'm not sure there's anything we could have done differently, or that we
> should do differently in the future.  We were able to diagnose the
> problem at all because we had pretty much all the hardware in double at
> the DC, and had we not isolated the fault we could still have fired up
> the backup server (once we had eleminated the shelves themselves as
> being faulty).
>
> The only thing we're missing right now is a spare disk enclosure; if we
> had had a failed shelf we would have been stuck having to wait for a
> replacement from the vendor rather than being able to simply swap the
> hardware on the spot.  That's an issue that I will raise at the next
> operations meeting.
>
> -- Marc
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20141106/df4a7a6c/attachment.html>