I recently noticed that some of our standard kvm/nova monitoring never got copied over from the labvirt puppet code to the cloudvirt puppet code. Tomorrow I will merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that.
Once that patch is merged, icinga will be a bit touchier on the cloudvirts. In particular, it will alert for any cloudvirt that has 0 VMs running on it. (This turns out to be a useful thing to watch for because we've had cases where every single kvm process died at once.)
So, all 'idle' cloudvirts should nonetheless have a canary instance. For example, on the new analytics cloudvirts I created canaries like this:
$ OS_PROJECT_ID=testlabs openstack server create --image 7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --availability-zone host:cloudvirtan1004 canary-an1004-01
Once a virt host is in full service we can leave the canaries there or delete them -- there hasn't been any real consistent policy there.
In related news, I'm attempting to silence cloudvirt1019 and 1020 altogether with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478115/ because we reboot them twice a day and a reboot always kills any running VMs.
-Andrew
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Sorry, all, this was meant for a different list. Feel free to ignore!
-A
On 12/6/18 5:16 PM, Andrew Bogott wrote:
I recently noticed that some of our standard kvm/nova monitoring never got copied over from the labvirt puppet code to the cloudvirt puppet code. Tomorrow I will merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that.
Once that patch is merged, icinga will be a bit touchier on the cloudvirts. In particular, it will alert for any cloudvirt that has 0 VMs running on it. (This turns out to be a useful thing to watch for because we've had cases where every single kvm process died at once.)
So, all 'idle' cloudvirts should nonetheless have a canary instance. For example, on the new analytics cloudvirts I created canaries like this:
$ OS_PROJECT_ID=testlabs openstack server create --image 7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --availability-zone host:cloudvirtan1004 canary-an1004-01
Once a virt host is in full service we can leave the canaries there or delete them -- there hasn't been any real consistent policy there.
In related news, I'm attempting to silence cloudvirt1019 and 1020 altogether with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478115/ because we reboot them twice a day and a reboot always kills any running VMs.
-Andrew
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce