[Labs-admin] instance creation outage (fixed)
Andrew Bogott
abogott at wikimedia.org
Sun May 7 17:19:50 UTC 2017
The fullstack test has been failing since last night, and I noticed
that I was unable to ping newly-created instances this morning. All of
the designate services were working properly and new DNS records were
registered in the designate DB database properly, but pdns was reporting
lots of timeouts with updating records. Additionally there were other
weird unfamiliar messages in various logs on labservices1002 (e.g.
designate-central kept saying 'Deadlock detected. Retrying...')
After restarting and poking many services, I tried a puppet run on
labservices1002 and discovered that it was SUPER slow, over 5 minutes
for a run that should have taken one minute or less. This looks
familiar to an issue that we saw on labvirt1001 a few months ago[1]
(basically IO just started to be really really slow for now reason)
which was resolved with a reboot.
...so...
Since there was a pending issue to switch the designate primary
back to labservices1001 anyway[2] I just went ahead and did that just
now[3][4]. That fixed the designate/dns issues. Then (after a 20
minute wait so that labs instances would know that 1001 is the dns
primary again) I rebooted 1002... but puppet runs are still incredibly
slow there.
So, in summary:
- All services are working normally now.
- Labservices1002 still seems ill in a way I have not yet diagnosed.
- Labservices1001 is back to being the primary dns and designate host.
I'm going to be out of the house for most of the afternoon but don't
hesitate to text or call me if things go haywire again. I'll also see
about making a proper incident report out of this email at some point.
-Andrew
[1] https://phabricator.wikimedia.org/T159835
[2] https://phabricator.wikimedia.org/T164014
[3]
https://wikitech.wikimedia.org/wiki/Labs_troubleshooting#instance_DNS_failure
[4] https://gerrit.wikimedia.org/r/#/c/352476/
More information about the Labs-admin
mailing list