I wanted to share an interesting failure I just saw on the Toolforge
cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I
assume that's the reason it was never deployed, although an
alternative is that the person merging the change forgot. This
published builds-api 0.0.131.
2. The harbor expiration policy noticed that builds-api 0.0.131
exists, and pruned the images for 0.0.130.
3. The certificates used for communications between the API gateway
and builds-api got renewed by cert-manager, and this triggered an
automatic restart for the builds-api deployment.
4. The new builds-api pods failed to start as the image they were
running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the
new deployment did not come up, and it stopped restarts of any further
pods and did not send any traffic to the single restarted pod.
However, the ticking time bomb for the expiring certificates remained
as the API would go down once the old certs expired, and any node
restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
* https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
* https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues
In addition we should consider setting up explicit
PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image
in the first space:
* Can we store all release tagged images indefinitely? How much
storage space would that take?
* If not, how can we prevent images still in use just disappearing
like that? How do we ensure that rollbacks will always work as
expected?
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi there,
Last year, we starting evaluating how we could refresh the way we relate
(deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0].
One of the most compelling options we found was to run Openstack inside
Kubernetes, using an upstream project called openstack-helm.
But... What if we stopped doing Openstack at all?
To clarify, the base idea I had is:
* deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters
** we know how to do it!
** this would be the base, undercloud, or bedrock, whatever.
* deploy ceph next to k8s (maybe, inside even?)
** ceph would remain the preferred network storage solution
* deploy some kind of k8s multiplexing tech
** example: https://www.vcluster.com/ but there could be others
** using this create a dedicated k8s cluster for each project, for example:
toolforge/toolsbeta/etc
* Inside this new VM-less toolforge, we can retain pretty much the same
functionalities as today:
** a container listening on 22/tcp with kubectl & toolforge cli installed can be
the login bastion
** NFS server can be run on a container, using ceph
** toolsDB can be run on a container. Can't it? Or maybe replace it with other
k8s-native solution
* If we need any of the native openstack components, for example Keystone or
Swift we may run them on an standalone fashion inside k8s.
* We already have some base infrastructure (and knowledge) that would support
this model. We have cloudlbs, cloudgw, we know how to do ceph, etc.
* And finally, and most important: the community. The main question could be:
** Is there any software running on Cloud VPS virtual machines that cannot run
on a container in kubernetes?
I wanted to start this email hoping that I would collect a list of use cases,
blockers, and strong opinions about why running Openstack is important (or not).
I'm pretty sure I'm overlooking some important thing.
I plan to document all this on wikitech, and/or maybe phabricator.
You may ask: and why stop doing openstack? I will answer that in a different
email to keep this one as short as possible.
Looking forward to your counter-arguments.
Thanks!
regards.
[0]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…