[Labs-admin] VM creation issues last night
Andrew Bogott
abogott at wikimedia.org
Thu Jul 20 06:31:40 UTC 2017
At some point tonight I noticed that a lot of new VMs (e.g. from
contintcloud) were in state ERROR. It turned out that instances were
being scheduled properly but never actually started running.
I have two theories for what was happening:
1) libvirt was upset about the old certs and refused to start new VMs.
This seems like the most-likely explanation, as the libvirtd.logs were
full of complaints about expired certs.
2) Maybe nova-network or some other part of the chain was still upset
about ldap
The only reason I'm not certain about #1 is that I built new certs,
installed them, and spent a long time poking and prodding at things
without any good result... finally I just decided to restart everything
nova (scheduler, conductor, network, api, all computes) and then the
system perked up. So maybe the cert was a red herring.
In any case, things seem fine now -- contintcloud is happy and the
fullstack tests are running. As best I can tell no one noticed this
outage -- jenkins wasn't running tests for a while but I didn't hear any
complaints.
-A
More information about the Labs-admin
mailing list