I just rebooted labservices1001 (our primary DNS server, among other things) from the console after it stopped responding to ssh. As far as I know, ordinary cloud services were unaffected by this since resolutions should have fallen back on 1002 gracefully. Puppet was upset because it explicitly names the resolver as labservices1001.
I suspect this is a bad fan or heat sink installation -- the failure is https://phabricator.wikimedia.org/T196252.
As best I can tell, the issue is resolved for now. You can expect a flood of recovery emails from shinken over the next 30 minutes or so.
My real concern, though, is that we didn't get paged when this box locked up. I remain baffled by what pages and what doesn't; does anyone out there know how I can turn on paging for a host and/or subscribe to alerts?
-A
cloud-admin@lists.wikimedia.org