<div dir="ltr">Interesting how the failure of a $5 (maybe?) part can result in Labs going offline. It will be interesting to see the post-action report and recommendations for prevention, if possible.<br><br>Thanks to those who were scrambling to get this fixed! I'm familiar with what happens when there's a hardware failure and there's a big scramble, having been in that position myself more than once.<br></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div><div><span style="background-color:rgb(217,234,211)"><span style="background-color:rgb(255,255,255)"><font style="white-space:nowrap;color:#000000"><font color="#01796F"><span style="color:rgb(0,0,0)">Pine</span><b><br><br></b></font></font></span><span style="color:rgb(102,102,102)"></span></span></div><span style="color:rgb(102,102,102)"></span><span style="color:rgb(102,102,102)"><span style="color:rgb(153,153,153)"><a href="https://www.wikipedia.org/" target="_blank"><u>This is an Encyclopedia</u></a></span><i><br>One gateway to the wide garden of knowledge, where lies <br>The deep rock of our past, in which we must delve <br>The well of our future,<br>The clear water we must leave untainted for those who come after us,<br>The fertile earth, in which truth may grow in bright places, tended by many hands,<br>And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.<br></i></span><span style="color:rgb(102,102,102)"><i><span style="color:rgb(204,204,204)"><span style="color:rgb(102,102,102)"><i>—</i>Catherine Munro</span><br></span><br></i></span><br></div></div></div></div></div>

<br><div class="gmail_quote">On Thu, Nov 6, 2014 at 12:38 PM, Marc A. Pelletier <span dir="ltr"><<a href="mailto:marc@uberbox.org" target="_blank">marc@uberbox.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 11/06/2014 02:38 PM, Andrew Bogott wrote:<br>

> Coren will follow up shortly with a full description of the problem and<br>

> advice about what (if anything) you may need to do to resurrect your jobs.<br>

<br>

</span>Hello again.<br>

<br>

NFS is back online.  The short story is that around 16:25, the NFS<br>

server went down instantly without explanation; it turns out that having<br>

nearly half the disks used by the backing filesystem disapear entirely<br>

is not something the OS can recover from or even react usefully to.<br>

<br>

Upon rebooting the system, two of the disk shelves (about 2/5 of the<br>

total storage) were invisible to the operating system, with it reporting<br>

hardware failures.<br>

<br>

After a long investigation with Chris (who was our hands in the<br>

datacenter), we finally managed to isolate the fault to - of all things<br>

- a cable connecting the disk shelves with each other.<br>

<br>

Around 19:30, after having replaced the faulty cabling, the NFS server<br>

went back on line and to normal operation.<br>

<br>

The filesystem backing Labs being gone for over three hours has a number<br>

of effects of varying impact: anything that was trying to read or write<br>

from the filesystems will have stalled for the entire period, possibly<br>

causing timeouts on things that depends on them.  Cron jobs (and other<br>

scheduled events) may have piled up, causing immense load and possibly<br>

memory outages on most instances.<br>

<br>

For non-tools labs users: Most of those effects will subside on their<br>

own as things catch up and recover, but you may need to reboot some<br>

instances that managed to cause OOMs on critical system daemons.<br>

<br>

For tool labs users: the infrastructure of tools /itself/ is back to<br>

normal, but there are impacts that may have affected jobs you had in<br>

progress:<br>

<br>

- cron jobs that were delayed during the outage will fire at most once<br>

- continuous jobs may have failed to restart properly, after running out<br>

of memory as things piled up; they may have ended up in error state as a<br>

consequence.<br>

<br>

If you are using bigbrother to monitor your jobs and/or webservices, it<br>

will reschedule yoru jobs as soon as the load allows and you should have<br>

little or nothing to do yourself.<br>

<br>

Jobs and tasks that do not depend on the filesystem may have survied<br>

entirely unschated.  In particular, the databases were not affected by<br>

the outage.<br>

<div class="HOEnZb"><div class="h5"><br>

-- Marc<br>

<br>

<br>

_______________________________________________<br>

Labs-l mailing list<br>

<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>

<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

</div></div></blockquote></div><br></div>