[Labs-admin] instance creation outage (fixed)

Sun May 7 17:19:50 UTC 2017

     The fullstack test has been failing since last night, and I noticed 
that I was unable to ping newly-created instances this morning.  All of 
the designate services were working properly and new DNS records were 
registered in the designate DB database properly, but pdns was reporting 
lots of timeouts with updating records.  Additionally there were other 
weird unfamiliar messages in various logs on labservices1002 (e.g. 
designate-central kept saying 'Deadlock detected. Retrying...')
     After restarting and poking many services, I tried a puppet run on 
labservices1002 and discovered that it was SUPER slow, over 5 minutes 
for a run that should have taken one minute or less.  This looks 
familiar to an issue that we saw on labvirt1001 a few months ago[1] 
(basically IO just started to be really really slow for now reason) 
which was resolved with a reboot.

...so...

     Since there was a pending issue to switch the designate primary 
back to labservices1001 anyway[2] I just went ahead and did that just 
now[3][4].  That fixed the designate/dns issues.  Then (after a 20 
minute wait so that labs instances would know that 1001 is the dns 
primary again) I rebooted 1002... but puppet runs are still incredibly 
slow there.

So, in summary:

- All services are working normally now.
- Labservices1002 still seems ill in a way I have not yet diagnosed.
- Labservices1001 is back to being the primary dns and designate host.

I'm going to be out of the house for most of the afternoon but don't 
hesitate to text or call me if things go haywire again.  I'll also see 
about making a proper incident report out of this email at some point.

-Andrew

[1] https://phabricator.wikimedia.org/T159835
[2] https://phabricator.wikimedia.org/T164014
[3] 
https://wikitech.wikimedia.org/wiki/Labs_troubleshooting#instance_DNS_failure
[4] https://gerrit.wikimedia.org/r/#/c/352476/