[Labs-l] Outage of labs in progress (resolved)

Fri Nov 7 13:21:01 UTC 2014

I must also congratulate the ops team on this, tracking down a failed cable
is almost worst than a needle in a haystack. Doing so in less than 4 hours
is phenomenal. Ive only seen a small handful of cases where a cable is the
cause of a failure.

On Fri, Nov 7, 2014 at 7:27 AM, Petr Bena <benapetr at gmail.com> wrote:

> Shit happens.
>
> Given that this outage was so quickly resolved I nearly didn't even
> notice it. Good job devops!
>
> On Fri, Nov 7, 2014 at 7:45 AM, Pine W <wiki.pine at gmail.com> wrote:
> > That sounds good. This story provides a good illustration of the
> usefulness
> > of having spares of critical parts.
> >
> > I've experienced my share of hardware mysteries. Today's mystery for me
> > involved a router
> >
> > Thank yoy again to those of you who were the hardware medics today.
> >
> > Regards,
> >
> > Pine
> >
> > On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <marc at uberbox.org> wrote:
> >>
> >> On 11/06/2014 05:13 PM, Pine W wrote:
> >> > It will be interesting to see the post-action report and
> recommendations
> >> > for prevention, if possible.
> >>
> >> There is, in the end, very little that can be done to prevent freak
> >> failures of the sort; they are thankfully rare but basically impossible
> >> to predict.
> >>
> >> The disk shelves have a lot of redundancy, but the two channels can be
> >> used either to multipath to a single server, or to wire two distinct
> >> servers; we chose the latter because servers - as a whole - have a lot
> >> more moving parts and a much shorter MTBF.  This makes us more
> >> vulnerable to the rarer failure of the communication path, and much less
> >> vunlerable to the server /itself/ having a failure of some sort.
> >>
> >> This time, we were just extremely unlucky.  Cabling rarely fails if it
> >> worked at all, and the chances that one would suddenly stop working
> >> right after a year of use is ridiculously low.  This is why it took
> >> quite a bit of time to even /locate/ the fault: we tried pretty much
> >> everything /else/ first given how improbable a cable fault is.  The
> >> actual fix took less than 15 minutes all told; the roughly three hours
> >> prior were spent trying to find the fault everywhere else first.
> >>
> >> I'm not sure there's anything we could have done differently, or that we
> >> should do differently in the future.  We were able to diagnose the
> >> problem at all because we had pretty much all the hardware in double at
> >> the DC, and had we not isolated the fault we could still have fired up
> >> the backup server (once we had eleminated the shelves themselves as
> >> being faulty).
> >>
> >> The only thing we're missing right now is a spare disk enclosure; if we
> >> had had a failed shelf we would have been stuck having to wait for a
> >> replacement from the vendor rather than being able to simply swap the
> >> hardware on the spot.  That's an issue that I will raise at the next
> >> operations meeting.
> >>
> >> -- Marc
> >>
> >>
> >> _______________________________________________
> >> Labs-l mailing list
> >> Labs-l at lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
> >
> > _______________________________________________
> > Labs-l mailing list
> > Labs-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20141107/583e6105/attachment.html>