Cloud-admin March 2024

cloud-admin@lists.wikimedia.org

3 participants
2 discussions

Interesting Toolforge API failure case
by Taavi Väänänen 03 Mar '24

03 Mar '24

I wanted to share an interesting failure I just saw on the Toolforge cluster. The order of events went roughly like this: 1. builds-api had a change merged that was never deployed to the live clusters. That change only affected local development environments. I assume that's the reason it was never deployed, although an alternative is that the person merging the change forgot. This published builds-api 0.0.131. 2. The harbor expiration policy noticed that builds-api 0.0.131 exists, and pruned the images for 0.0.130. 3. The certificates used for communications between the API gateway and builds-api got renewed by cert-manager, and this triggered an automatic restart for the builds-api deployment. 4. The new builds-api pods failed to start as the image they were running on no longer exists. Now, in this case, Kubernetes worked as expected, and noticed that the new deployment did not come up, and it stopped restarts of any further pods and did not send any traffic to the single restarted pod. However, the ticking time bomb for the expiring certificates remained as the API would go down once the old certs expired, and any node restarts would have risked taking the entire thing down. I filed a few tasks, mostly about noticing these kinds of issues automatically: * https://phabricator.wikimedia.org/T358908 Alert when toolforge-deploy changes are not deployed * https://phabricator.wikimedia.org/T358909 Alert when admin managed pods are having issues In addition we should consider setting up explicit PodDisruptionBudgets for the admin services we manage. However, what I'm less certain on is how to prevent that missing image in the first space: * Can we store all release tagged images indefinitely? How much storage space would that take? * If not, how can we prevent images still in use just disappearing like that? How do we ensure that rollbacks will always work as expected? Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

2 1

What if we stop doing Openstack
by Arturo Borrero Gonzalez 02 Mar '24

02 Mar '24

Hi there, Last year, we starting evaluating how we could refresh the way we relate (deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0]. One of the most compelling options we found was to run Openstack inside Kubernetes, using an upstream project called openstack-helm. But... What if we stopped doing Openstack at all? To clarify, the base idea I had is: * deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters ** we know how to do it! ** this would be the base, undercloud, or bedrock, whatever. * deploy ceph next to k8s (maybe, inside even?) ** ceph would remain the preferred network storage solution * deploy some kind of k8s multiplexing tech ** example: https://www.vcluster.com/ but there could be others ** using this create a dedicated k8s cluster for each project, for example: toolforge/toolsbeta/etc * Inside this new VM-less toolforge, we can retain pretty much the same functionalities as today: ** a container listening on 22/tcp with kubectl & toolforge cli installed can be the login bastion ** NFS server can be run on a container, using ceph ** toolsDB can be run on a container. Can't it? Or maybe replace it with other k8s-native solution * If we need any of the native openstack components, for example Keystone or Swift we may run them on an standalone fashion inside k8s. * We already have some base infrastructure (and knowledge) that would support this model. We have cloudlbs, cloudgw, we know how to do ceph, etc. * And finally, and most important: the community. The main question could be: ** Is there any software running on Cloud VPS virtual machines that cannot run on a container in kubernetes? I wanted to start this email hoping that I would collect a list of use cases, blockers, and strong opinions about why running Openstack is important (or not). I'm pretty sure I'm overlooking some important thing. I plan to document all this on wikitech, and/or maybe phabricator. You may ask: and why stop doing openstack? I will answer that in a different email to keep this one as short as possible. Looking forward to your counter-arguments. Thanks! regards. [0] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…

4 4

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin March 2024