[Labs-l] Another labs outage - curse of the accursed hardware failure continues
Andrew Bogott
abogott at wikimedia.org
Fri Feb 27 20:10:43 UTC 2015
Here's the complete writeup of this outage:
https://wikitech.wikimedia.org/wiki/Incident_documentation/2015027-LabsOutage
Beta admins: Please peruse the list of vulnerable instances on the
bottom of that report. If you identify failure points on that list,
please contact me today to migrate them to different hardware. Also,
I'd encourage you to identify SPOFs throughout the project and consider
ways to increase redundancy.
-Andrew
On 2/26/15 10:12 PM, Yuvi Panda wrote:
> Hello!
>
> A repeat of the failure that happened a few days ago. Underlying flaky
> hardware, andrewbogott is looking into it atm.
>
> == Why is everything so terrible? ==
>
> Labs instances are Virtual Machines that run on physical hardware.
> When the underlying hardware dies, the virtual machines on them also
> die. This is similar to AWS or other cloud providers. We had one spare
> machine (virt1012) in case any of the currently in use machines died
> and needed a lifeboat.
>
> A week or so ago one of the machines (virt1005) died, and we migrated
> things to virt1012. This week, the new machine, virt1012, has been
> having issues, and that's why the outages are all so similar. So the
> current instability is basically caused by *two* different
> hardware-related issues happening to two different machines with
> different configuration.
>
> IT IS A CURSE!
>
> == Making things better? ==
>
> We're adding more hardware. https://phabricator.wikimedia.org/T90783
> is the ticket for that.
>
> And specifically for toollabs, it would be awesome for it to be able
> to survive one virt* node being down. This is not an easy problem to
> solve, but here's the tracking ticket for it:
> https://phabricator.wikimedia.org/T90542
>
> Andrew is working through his night (again) to diagnose / fix this
> issue (thanks!) and we'll keep you updated as things progress. Thank
> you for your patience.
>
More information about the Labs-l
mailing list