<p dir="ltr">That sounds good. This story provides a good illustration of the usefulness of having spares of critical parts.</p>

<p dir="ltr">I've experienced my share of hardware mysteries. Today's mystery for me involved a router</p>

<p dir="ltr">Thank yoy again to those of you who were the hardware medics today.</p>

<p dir="ltr">Regards,</p>

<p dir="ltr">Pine</p>

<div class="gmail_quote">On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <<a href="mailto:marc@uberbox.org">marc@uberbox.org</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 11/06/2014 05:13 PM, Pine W wrote:<br>

> It will be interesting to see the post-action report and recommendations<br>

> for prevention, if possible.<br>

<br>

There is, in the end, very little that can be done to prevent freak<br>

failures of the sort; they are thankfully rare but basically impossible<br>

to predict.<br>

<br>

The disk shelves have a lot of redundancy, but the two channels can be<br>

used either to multipath to a single server, or to wire two distinct<br>

servers; we chose the latter because servers - as a whole - have a lot<br>

more moving parts and a much shorter MTBF.  This makes us more<br>

vulnerable to the rarer failure of the communication path, and much less<br>

vunlerable to the server /itself/ having a failure of some sort.<br>

<br>

This time, we were just extremely unlucky.  Cabling rarely fails if it<br>

worked at all, and the chances that one would suddenly stop working<br>

right after a year of use is ridiculously low.  This is why it took<br>

quite a bit of time to even /locate/ the fault: we tried pretty much<br>

everything /else/ first given how improbable a cable fault is.  The<br>

actual fix took less than 15 minutes all told; the roughly three hours<br>

prior were spent trying to find the fault everywhere else first.<br>

<br>

I'm not sure there's anything we could have done differently, or that we<br>

should do differently in the future.  We were able to diagnose the<br>

problem at all because we had pretty much all the hardware in double at<br>

the DC, and had we not isolated the fault we could still have fired up<br>

the backup server (once we had eleminated the shelves themselves as<br>

being faulty).<br>

<br>

The only thing we're missing right now is a spare disk enclosure; if we<br>

had had a failed shelf we would have been stuck having to wait for a<br>

replacement from the vendor rather than being able to simply swap the<br>

hardware on the spot.  That's an issue that I will raise at the next<br>

operations meeting.<br>

<br>

-- Marc<br>

<br>

<br>

_______________________________________________<br>

Labs-l mailing list<br>

<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>

<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

</blockquote></div>