[Labs-l] Another labs outage - curse of the accursed hardware failure continues

Fri Feb 27 06:12:23 UTC 2015

Hello!

A repeat of the failure that happened a few days ago. Underlying flaky
hardware, andrewbogott is looking into it atm.

== Why is everything so terrible? ==

Labs instances are Virtual Machines that run on physical hardware.
When the underlying hardware dies, the virtual machines on them also
die. This is similar to AWS or other cloud providers. We had one spare
machine (virt1012) in case any of the currently in use machines died
and needed a lifeboat.

A week or so ago one of the machines (virt1005) died, and we migrated
things to virt1012. This week, the new machine, virt1012, has been
having issues, and that's why the outages are all so similar. So the
current instability is basically caused by *two* different
hardware-related issues happening to two different machines with
different configuration.

IT IS A CURSE!

== Making things better? ==

We're adding more hardware. https://phabricator.wikimedia.org/T90783
is the ticket for that.

And specifically for toollabs, it would be awesome for it to be able
to survive one virt* node being down. This is not an easy problem to
solve, but here's the tracking ticket for it:
https://phabricator.wikimedia.org/T90542

Andrew is working through his night (again) to diagnose / fix this
issue (thanks!) and we'll keep you updated as things progress. Thank
you for your patience.

-- 
Yuvi Panda T
http://yuvi.in/blog