I wanted to share an interesting failure I just saw on the Toolforge cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live clusters. That change only affected local development environments. I assume that's the reason it was never deployed, although an alternative is that the person merging the change forgot. This published builds-api 0.0.131. 2. The harbor expiration policy noticed that builds-api 0.0.131 exists, and pruned the images for 0.0.130. 3. The certificates used for communications between the API gateway and builds-api got renewed by cert-manager, and this triggered an automatic restart for the builds-api deployment. 4. The new builds-api pods failed to start as the image they were running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the new deployment did not come up, and it stopped restarts of any further pods and did not send any traffic to the single restarted pod. However, the ticking time bomb for the expiring certificates remained as the API would go down once the old certs expired, and any node restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically: * https://phabricator.wikimedia.org/T358908 Alert when toolforge-deploy changes are not deployed * https://phabricator.wikimedia.org/T358909 Alert when admin managed pods are having issues In addition we should consider setting up explicit PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image in the first space: * Can we store all release tagged images indefinitely? How much storage space would that take? * If not, how can we prevent images still in use just disappearing like that? How do we ensure that rollbacks will always work as expected?
Taavi
On 03/01 23:30, Taavi Väänänen wrote:
I wanted to share an interesting failure I just saw on the Toolforge cluster. The order of events went roughly like this:
- builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I assume that's the reason it was never deployed, although an alternative is that the person merging the change forgot. This published builds-api 0.0.131. 2. The harbor expiration policy noticed that builds-api 0.0.131 exists, and pruned the images for 0.0.130. 3. The certificates used for communications between the API gateway and builds-api got renewed by cert-manager, and this triggered an automatic restart for the builds-api deployment. 4. The new builds-api pods failed to start as the image they were running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the new deployment did not come up, and it stopped restarts of any further pods and did not send any traffic to the single restarted pod. However, the ticking time bomb for the expiring certificates remained as the API would go down once the old certs expired, and any node restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
- https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
- https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues In addition we should consider setting up explicit PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image in the first space:
- Can we store all release tagged images indefinitely? How much
storage space would that take?
We should be keeping the last 10 tags already, I'm suspecting some name-parsing bug it jumps from 99 to 131 here: https://tools-harbor.wmcloud.org/harbor/projects/1454/repositories/builds-ap...
Note that toolsbeta does clear storage more aggresively (and does not have immutable tags iirc).
- If not, how can we prevent images still in use just disappearing
like that? How do we ensure that rollbacks will always work as expected?
Taavi
-- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation _______________________________________________ Cloud-admin mailing list -- cloud-admin@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/
cloud-admin@lists.wikimedia.org