Toolforge Kubernetes disrupted from 2019-09-10T18:54 to 2019-09-11T01:30 - Cloud-announce

11 Sep 2019

We need to do a proper incident report, but I wanted to send out a
(late) notice that the Toolforge Kubernetes cluster was at best
degraded and at worst completely broken from 2019-09-10T18:54 to
2019-09-11T01:30.

The TL;DR is that some change, likely part of T171188: Move the main
WMCS puppetmaster into the Labs realm, tricked Puppet into installing
an old version of the x509 signing cert used to secure communication
between the etcd cluster and kube-apiserver. This manifested in an
alert from our monitoring system of the Kubernetes api being broken.
When investigating that alert we found that the kube-apiserver was
unable to connect to its paired etcd cluster. The etcd cluster seemed
to be flapping internally (status showing good, then failed, then good
again). Diagnosing the cause of this flapping resulted in a complete
failure of the etcd cluster. Restoring the etcd cluster was a long and
difficult task. Once etcd was recovered, it took about 1.5 more hours
to find the cause and fix for the initial communication errors (the
wrong x509 signing certificate). It is currently unclear if the x509
misconfiguration also caused the etcd cluster failure, or if that was
an unrelated and unfortunate coincidence.

See https://phabricator.wikimedia.org/T232536 for follow up
documentation (when we write it during the coming US business day).

Bryan - on behalf of the Toolforge admin team
-- 
Bryan Davis              Technical Engagement      Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808