[Labs-l] Outage of labs in progress (resolved)

Fri Nov 7 12:27:41 UTC 2014

Shit happens.

Given that this outage was so quickly resolved I nearly didn't even
notice it. Good job devops!

On Fri, Nov 7, 2014 at 7:45 AM, Pine W <wiki.pine at gmail.com> wrote:
> That sounds good. This story provides a good illustration of the usefulness
> of having spares of critical parts.
>
> I've experienced my share of hardware mysteries. Today's mystery for me
> involved a router
>
> Thank yoy again to those of you who were the hardware medics today.
>
> Regards,
>
> Pine
>
> On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <marc at uberbox.org> wrote:
>>
>> On 11/06/2014 05:13 PM, Pine W wrote:
>> > It will be interesting to see the post-action report and recommendations
>> > for prevention, if possible.
>>
>> There is, in the end, very little that can be done to prevent freak
>> failures of the sort; they are thankfully rare but basically impossible
>> to predict.
>>
>> The disk shelves have a lot of redundancy, but the two channels can be
>> used either to multipath to a single server, or to wire two distinct
>> servers; we chose the latter because servers - as a whole - have a lot
>> more moving parts and a much shorter MTBF.  This makes us more
>> vulnerable to the rarer failure of the communication path, and much less
>> vunlerable to the server /itself/ having a failure of some sort.
>>
>> This time, we were just extremely unlucky.  Cabling rarely fails if it
>> worked at all, and the chances that one would suddenly stop working
>> right after a year of use is ridiculously low.  This is why it took
>> quite a bit of time to even /locate/ the fault: we tried pretty much
>> everything /else/ first given how improbable a cable fault is.  The
>> actual fix took less than 15 minutes all told; the roughly three hours
>> prior were spent trying to find the fault everywhere else first.
>>
>> I'm not sure there's anything we could have done differently, or that we
>> should do differently in the future.  We were able to diagnose the
>> problem at all because we had pretty much all the hardware in double at
>> the DC, and had we not isolated the fault we could still have fired up
>> the backup server (once we had eleminated the shelves themselves as
>> being faulty).
>>
>> The only thing we're missing right now is a spare disk enclosure; if we
>> had had a failed shelf we would have been stuck having to wait for a
>> replacement from the vendor rather than being able to simply swap the
>> hardware on the spot.  That's an issue that I will raise at the next
>> operations meeting.
>>
>> -- Marc
>>
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>