On 03/01 23:30, Taavi Väänänen wrote:
I wanted to share an interesting failure I just saw on
the Toolforge
cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I
assume that's the reason it was never deployed, although an
alternative is that the person merging the change forgot. This
published builds-api 0.0.131.
2. The harbor expiration policy noticed that builds-api 0.0.131
exists, and pruned the images for 0.0.130.
3. The certificates used for communications between the API gateway
and builds-api got renewed by cert-manager, and this triggered an
automatic restart for the builds-api deployment.
4. The new builds-api pods failed to start as the image they were
running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the
new deployment did not come up, and it stopped restarts of any further
pods and did not send any traffic to the single restarted pod.
However, the ticking time bomb for the expiring certificates remained
as the API would go down once the old certs expired, and any node
restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
*
https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
*
https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues
In addition we should consider setting up explicit
PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image
in the first space:
* Can we store all release tagged images indefinitely? How much
storage space would that take?
We should be keeping the last 10 tags already, I'm suspecting some name-parsing
bug it jumps from 99 to 131 here:
https://tools-harbor.wmcloud.org/harbor/projects/1454/repositories/builds-a…
Note that toolsbeta does clear storage more aggresively (and does not have
immutable tags iirc).
* If not, how can we prevent images still in use just
disappearing
like that? How do we ensure that rollbacks will always work as
expected?
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
_______________________________________________
Cloud-admin mailing list -- cloud-admin(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."