[Labs-admin] VM creation issues last night

Andrew Bogott abogott at wikimedia.org
Thu Jul 20 06:31:40 UTC 2017


At some point tonight I noticed that a lot of new VMs (e.g. from 
contintcloud) were in state ERROR.  It turned out that instances were 
being scheduled properly but never actually started running.

I have two theories for what was happening:

1) libvirt was upset about the old certs and refused to start new VMs.  
This seems like the most-likely explanation, as the libvirtd.logs were 
full of complaints about expired certs.

2) Maybe nova-network or some other part of the chain was still upset 
about ldap

The only reason I'm not certain about #1 is that I built new certs, 
installed them, and spent a long time poking and prodding at things 
without any good result... finally I just decided to restart everything 
nova (scheduler, conductor, network, api, all computes) and then the 
system perked up.  So maybe the cert was a red herring.

In any case, things seem fine now -- contintcloud is happy and the 
fullstack tests are running.  As best I can tell no one noticed this 
outage -- jenkins wasn't running tests for a while but I didn't hear any 
complaints.

-A



More information about the Labs-admin mailing list