[Labs-l] Outage of labs in progress (resolved)

Thu Nov 6 20:38:26 UTC 2014

On 11/06/2014 02:38 PM, Andrew Bogott wrote:
> Coren will follow up shortly with a full description of the problem and
> advice about what (if anything) you may need to do to resurrect your jobs.

Hello again.

NFS is back online.  The short story is that around 16:25, the NFS
server went down instantly without explanation; it turns out that having
nearly half the disks used by the backing filesystem disapear entirely
is not something the OS can recover from or even react usefully to.

Upon rebooting the system, two of the disk shelves (about 2/5 of the
total storage) were invisible to the operating system, with it reporting
hardware failures.

After a long investigation with Chris (who was our hands in the
datacenter), we finally managed to isolate the fault to - of all things
- a cable connecting the disk shelves with each other.

Around 19:30, after having replaced the faulty cabling, the NFS server
went back on line and to normal operation.

The filesystem backing Labs being gone for over three hours has a number
of effects of varying impact: anything that was trying to read or write
from the filesystems will have stalled for the entire period, possibly
causing timeouts on things that depends on them.  Cron jobs (and other
scheduled events) may have piled up, causing immense load and possibly
memory outages on most instances.

For non-tools labs users: Most of those effects will subside on their
own as things catch up and recover, but you may need to reboot some
instances that managed to cause OOMs on critical system daemons.

For tool labs users: the infrastructure of tools /itself/ is back to
normal, but there are impacts that may have affected jobs you had in
progress:

- cron jobs that were delayed during the outage will fire at most once
- continuous jobs may have failed to restart properly, after running out
of memory as things piled up; they may have ended up in error state as a
consequence.

If you are using bigbrother to monitor your jobs and/or webservices, it
will reschedule yoru jobs as soon as the load allows and you should have
little or nothing to do yourself.

Jobs and tasks that do not depend on the filesystem may have survied
entirely unschated.  In particular, the databases were not affected by
the outage.

-- Marc