[Labs-l] IMPORTANT: Many instances slated for reboot and downtime this weekend (finished)

Andrew Bogott abogott at wikimedia.org
Sat Sep 20 20:39:03 UTC 2014


This is done now, and all the affected instances are up and running.

-Andrew


> On 9/16/14 10:45 AM, Andrew Bogott wrote:
>> -- Executive Summary:
>>
>> Many instances will be rebooted at some point this weekend or next 
>> week.  The total list of instances subject to reboot is here:
>>
>> https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
>>
>> Tools and Beta users can ignore this email.
>>
>>
>> -- The full story:
>>
>> Sorry about sending two different IMPORTANT emails this week; we 
>> generally try to keep labs crises to a minimum.  Indeed, this email 
>> is about avoiding a potential crisis.
>>
>> The labs server known as 'virt1006' has been acting poorly lately. 
>> Several times in the last month we've seen instances that live on 
>> virt1006 get into inconsistent states during reboot... they reboot 
>> and never come back up, or they stay in a perpetual 'rebooting' state.
>>
>> So far we've been able to rescue such instances, but the misbehavior 
>> of a Labs server is very disconcerting.  Rather than wait for a full 
>> collapse (and resulting sudden death of 50+ VMs) we've decided to 
>> migrate all instances instances off of virt1006 and then either 
>> rebuild the system or discard the hardware. Moving an instance off of 
>> a server is fairly painless, but it does require a few minutes of 
>> downtime and a reboot.
>>
>> I've spoken to a few of you directly about the reboots; the affected 
>> Tools and Deployment-prep instances have already been handled. There 
>> are a lot more to go, though.  If your instance is stable and has its 
>> init scripts set up properly and a reboot is no big deal, then, 
>> congratulations!  Otherwise, please take whatever steps you need to 
>> take to batten down the hatches and get ready for a reboot.
>>
>> If you need the reboot to happen at a scheduled time while you are 
>> standing by, that's totally fine.  In that case please schedule a 
>> reboot window on this page:
>>
>> https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
>>
>> Thanks for your cooperation.
>>
>> -Andrew
>




More information about the Labs-l mailing list