[Labs-l] IMPORTANT: Many instances slated for reboot and downtime this weekend

Andrew Bogott abogott at wikimedia.org
Tue Sep 16 15:45:20 UTC 2014


-- Executive Summary:

Many instances will be rebooted at some point this weekend or next 
week.  The total list of instances subject to reboot is here:

https://wikitech.wikimedia.org/wiki/Virt1006_rebuild

Tools and Beta users can ignore this email.


-- The full story:

Sorry about sending two different IMPORTANT emails this week; we 
generally try to keep labs crises to a minimum.  Indeed, this email is 
about avoiding a potential crisis.

The labs server known as 'virt1006' has been acting poorly lately. 
Several times in the last month we've seen instances that live on 
virt1006 get into inconsistent states during reboot... they reboot and 
never come back up, or they stay in a perpetual 'rebooting' state.

So far we've been able to rescue such instances, but the misbehavior of 
a Labs server is very disconcerting.  Rather than wait for a full 
collapse (and resulting sudden death of 50+ VMs) we've decided to 
migrate all instances instances off of virt1006 and then either rebuild 
the system or discard the hardware.  Moving an instance off of a server 
is fairly painless, but it does require a few minutes of downtime and a 
reboot.

I've spoken to a few of you directly about the reboots; the affected 
Tools and Deployment-prep instances have already been handled. There are 
a lot more to go, though.  If your instance is stable and has its init 
scripts set up properly and a reboot is no big deal, then, 
congratulations!  Otherwise, please take whatever steps you need to take 
to batten down the hatches and get ready for a reboot.

If you need the reboot to happen at a scheduled time while you are 
standing by, that's totally fine.  In that case please schedule a reboot 
window on this page:

https://wikitech.wikimedia.org/wiki/Virt1006_rebuild

Thanks for your cooperation.

-Andrew



More information about the Labs-l mailing list