[Labs-l] Outage of labs in progress (resolved)

Thu Nov 6 22:13:24 UTC 2014

Interesting how the failure of a $5 (maybe?) part can result in Labs going
offline. It will be interesting to see the post-action report and
recommendations for prevention, if possible.

Thanks to those who were scrambling to get this fixed! I'm familiar with
what happens when there's a hardware failure and there's a big scramble,
having been in that position myself more than once.

Pine

*This is an Encyclopedia* <https://www.wikipedia.org/>

*One gateway to the wide garden of knowledge, where lies The deep rock of
our past, in which we must delve The well of our future,The clear water we
must leave untainted for those who come after us,The fertile earth, in
which truth may grow in bright places, tended by many hands,And the broad
fall of sunshine, warming our first steps toward knowing how much we do not
know.*

*—Catherine Munro*

On Thu, Nov 6, 2014 at 12:38 PM, Marc A. Pelletier <marc at uberbox.org> wrote:

> On 11/06/2014 02:38 PM, Andrew Bogott wrote:
> > Coren will follow up shortly with a full description of the problem and
> > advice about what (if anything) you may need to do to resurrect your
> jobs.
>
> Hello again.
>
> NFS is back online.  The short story is that around 16:25, the NFS
> server went down instantly without explanation; it turns out that having
> nearly half the disks used by the backing filesystem disapear entirely
> is not something the OS can recover from or even react usefully to.
>
> Upon rebooting the system, two of the disk shelves (about 2/5 of the
> total storage) were invisible to the operating system, with it reporting
> hardware failures.
>
> After a long investigation with Chris (who was our hands in the
> datacenter), we finally managed to isolate the fault to - of all things
> - a cable connecting the disk shelves with each other.
>
> Around 19:30, after having replaced the faulty cabling, the NFS server
> went back on line and to normal operation.
>
> The filesystem backing Labs being gone for over three hours has a number
> of effects of varying impact: anything that was trying to read or write
> from the filesystems will have stalled for the entire period, possibly
> causing timeouts on things that depends on them.  Cron jobs (and other
> scheduled events) may have piled up, causing immense load and possibly
> memory outages on most instances.
>
> For non-tools labs users: Most of those effects will subside on their
> own as things catch up and recover, but you may need to reboot some
> instances that managed to cause OOMs on critical system daemons.
>
> For tool labs users: the infrastructure of tools /itself/ is back to
> normal, but there are impacts that may have affected jobs you had in
> progress:
>
> - cron jobs that were delayed during the outage will fire at most once
> - continuous jobs may have failed to restart properly, after running out
> of memory as things piled up; they may have ended up in error state as a
> consequence.
>
> If you are using bigbrother to monitor your jobs and/or webservices, it
> will reschedule yoru jobs as soon as the load allows and you should have
> little or nothing to do yourself.
>
> Jobs and tasks that do not depend on the filesystem may have survied
> entirely unschated.  In particular, the databases were not affected by
> the outage.
>
> -- Marc
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20141106/ff2f3473/attachment.html>