Interesting Toolforge API failure case - Cloud-admin

1 Mar 2024


      I wanted to share an interesting failure I just saw on the Toolforge
cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I
assume that's the reason it was never deployed, although an
alternative is that the person merging the change forgot. This
published builds-api 0.0.131.
2. The harbor expiration policy noticed that builds-api 0.0.131
exists, and pruned the images for 0.0.130.
3. The certificates used for communications between the API gateway
and builds-api got renewed by cert-manager, and this triggered an
automatic restart for the builds-api deployment.
4. The new builds-api pods failed to start as the image they were
running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the
new deployment did not come up, and it stopped restarts of any further
pods and did not send any traffic to the single restarted pod.
However, the ticking time bomb for the expiring certificates remained
as the API would go down once the old certs expired, and any node
restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
* https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
* https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues
In addition we should consider setting up explicit
PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image
in the first space:
* Can we store all release tagged images indefinitely? How much
storage space would that take?
* If not, how can we prevent images still in use just disappearing
like that? How do we ensure that rollbacks will always work as
expected?
Taavi
-- 
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation