- Cloud-admin - lists.wikimedia.org

Interesting Toolforge API failure case
by Taavi Väänänen 03 Mar '24

03 Mar '24

I wanted to share an interesting failure I just saw on the Toolforge cluster. The order of events went roughly like this: 1. builds-api had a change merged that was never deployed to the live clusters. That change only affected local development environments. I assume that's the reason it was never deployed, although an alternative is that the person merging the change forgot. This published builds-api 0.0.131. 2. The harbor expiration policy noticed that builds-api 0.0.131 exists, and pruned the images for 0.0.130. 3. The certificates used for communications between the API gateway and builds-api got renewed by cert-manager, and this triggered an automatic restart for the builds-api deployment. 4. The new builds-api pods failed to start as the image they were running on no longer exists. Now, in this case, Kubernetes worked as expected, and noticed that the new deployment did not come up, and it stopped restarts of any further pods and did not send any traffic to the single restarted pod. However, the ticking time bomb for the expiring certificates remained as the API would go down once the old certs expired, and any node restarts would have risked taking the entire thing down. I filed a few tasks, mostly about noticing these kinds of issues automatically: * https://phabricator.wikimedia.org/T358908 Alert when toolforge-deploy changes are not deployed * https://phabricator.wikimedia.org/T358909 Alert when admin managed pods are having issues In addition we should consider setting up explicit PodDisruptionBudgets for the admin services we manage. However, what I'm less certain on is how to prevent that missing image in the first space: * Can we store all release tagged images indefinitely? How much storage space would that take? * If not, how can we prevent images still in use just disappearing like that? How do we ensure that rollbacks will always work as expected? Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

2 1

What if we stop doing Openstack
by Arturo Borrero Gonzalez 02 Mar '24

02 Mar '24

Hi there, Last year, we starting evaluating how we could refresh the way we relate (deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0]. One of the most compelling options we found was to run Openstack inside Kubernetes, using an upstream project called openstack-helm. But... What if we stopped doing Openstack at all? To clarify, the base idea I had is: * deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters ** we know how to do it! ** this would be the base, undercloud, or bedrock, whatever. * deploy ceph next to k8s (maybe, inside even?) ** ceph would remain the preferred network storage solution * deploy some kind of k8s multiplexing tech ** example: https://www.vcluster.com/ but there could be others ** using this create a dedicated k8s cluster for each project, for example: toolforge/toolsbeta/etc * Inside this new VM-less toolforge, we can retain pretty much the same functionalities as today: ** a container listening on 22/tcp with kubectl & toolforge cli installed can be the login bastion ** NFS server can be run on a container, using ceph ** toolsDB can be run on a container. Can't it? Or maybe replace it with other k8s-native solution * If we need any of the native openstack components, for example Keystone or Swift we may run them on an standalone fashion inside k8s. * We already have some base infrastructure (and knowledge) that would support this model. We have cloudlbs, cloudgw, we know how to do ceph, etc. * And finally, and most important: the community. The main question could be: ** Is there any software running on Cloud VPS virtual machines that cannot run on a container in kubernetes? I wanted to start this email hoping that I would collect a list of use cases, blockers, and strong opinions about why running Openstack is important (or not). I'm pretty sure I'm overlooking some important thing. I plan to document all this on wikitech, and/or maybe phabricator. You may ask: and why stop doing openstack? I will answer that in a different email to keep this one as short as possible. Looking forward to your counter-arguments. Thanks! regards. [0] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…

4 4

New Bookworm + Containerd based Toolforge Kubernetes workers in service
by Taavi Väänänen 25 Jan '24

25 Jan '24

There are now two new Toolforge Kubernetes workers in service, tools-k8s-worker-nfs-1 and tools-k8s-worker-nfs-2. In addition to the new naming scheme that will allow non-NFS workers in the future, these hosts are also running Debian 12 (as opposed to Debian 10 on the existing nodes) and are using Containerd as the container runtime (the current nodes are using Docker). If you see or hear about any strange issues on pods running on these new nodes, please depool them (`kubectl sudo drain $WORKER` on a Toolforge bastion) and ping me on IRC or in the task ( https://phabricator.wikimedia.org/T284656). If there are no major issues I will start replacing more of the older nodes with these new nodes next week. Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

1 0

Toolforge next steps
by David Caro 17 Jan '24

17 Jan '24

Hi all! Lately we released the public beta for the build service, that allows users to build images from their language of choice package manager settings (for supported languages), with minimal configuration. While we are still fixing bugs and corner cases, we should start thinking on the next steps for toolforge. I have started a document with some ideas on how toolforge could look like in the future for users, and I want to gather other ideas and feedback so we can write down a high level view to help define the next round of improvements to focus on. The document is here: https://docs.google.com/document/d/1sqo6YGRn9u-S7V0y9m07cYKA84vQlKa-7_F8p7e… And it already takes some of the ideas from previous documents/meetings/etc. The timeline is to brainstorm there until Tuesday next week, and then in the Toolforge Workgroup check-in we can start trying to coalesce the main ideas to a workable plan. Note that this is not meant to be written on stone, but instead we will review it a couple times a year to make sure that we are doing what we want to do, and change course if needed. Thanks!

1 0

de-gridding abandoned tools
by Andrew Bogott 19 Dec '23

19 Dec '23

Komla has started to disable the grid for tools that seem abandoned. The workboard for this is at https://phabricator.wikimedia.org/project/view/6135/ I believe that tools are moving from 'Unreached Tool' to 'Disabled' as they are disabled. == How to disable (or re-enable) a tool? There are two scripts, each run in a different place. BOTH scripts should be run for any tool. It should be safe to run any of these commands multiple times without additional effect. To disable the grid for a tool: On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/disable_grid_for_tool.py <toolname> On tools-sgecron-2.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/stop_grid_for_tool.py <toolname> To re-enable the grid for a tool: On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/disable_grid_for_tool.py --enable <toolname> On tools-sgecron-2.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/stop_grid_for_tool.py --enable <toolname> == Who can re-enable a tool, and when? == This shut-down phase has two goals: 1) Stop grid jobs that no one cares about 2) Provide a 'warning shot' to get attention from users or admins of a tool who are relying on the tool but not responding to Komla's correspondence. Anyone with the necessary logins is encouraged re-enable tools as needed. Specifically: - If you are contacted by a tool admin requesting restoration, feel free to restore the tool according to the steps above. First, though, please make sure the concerned admin is aware that the grid is going away, and make sure you (or better yet the admin) update the workboard task associated with the tool explaining how they plan to deal with the coming shut-down and how they can be contacted in the future. - If you are contacted by users of a tool requesting restoration, please encourage them to reach out to the admin and have the admin request restoration directly. If it's clear that a tool is needed but has no reachable admin, add notes to the phab task accordingly, then move the task into the 'Help wanted' column and add 'Abandoned:' to the task title. == What is disabling/enabling? == The disable scripts do the following: - set a grid quota that prevents future jobs from being scheduled - move grid-specific service.manifest files to 'service.disabledmanifest' - add a 'TOOL_DISABLED' to the tool home - archive crontab - qdel all existing grid jobs Enable scripts do this: - remove restrictive grid quota, permitting jobs to be scheduled - move 'service.disabledmanifest' back to service.manifest if no service.manifest is currently present - remove 'TOOL_DISABLED' file - restore crontab Note that the enable script do not actively start anything. So non-webservice tools will likely require a manual start after enabling.

1 0

Toolforge Kubernetes cluster capacity issues
by Taavi Väänänen 02 Dec '23

02 Dec '23

We ran out of capacity on the Toolforge Kuberneter cluster yesterday, seemingly due to a large number of tools migrating from the grid engine to Kubernetes and a temporary decrease in capacity during a cluster-wide reboot to recover from a NFS blip. I've provisioned some extra nodes to fix the immediate issue, but the total CPU requests are still around 90% of the total cluster capacity. (Note that this does not mean that we're using 90% of CPU power available there, I'll come back to this in a bit.) *In case the cluster starts acting up again*: follow https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolfor… to provision more capacity. That runbook also has a link to the Grafana dashboard for cluster capacity and instructions on what specific metrics to worry about there, given that there are no alerts for it yet <https://phabricator.wikimedia.org/T352581>. As I said, we seem to be overprovisioning CPUs by a lot compared to actual usage: `kubectl sudo top node` shows a majority of nodes being below 10% of actual CPU utilization. So in the near term we should look at tweaking the resource allocation logic especially for web services. Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

1 0

Creation of new Buster VMs is now restricted
by Taavi Väänänen 17 Nov '23

17 Nov '23

I just restricted the Debian 10 Buster Cloud VPS image to the Toolforge project. Puppet 7 packages aren't available for Buster and it won't be possible to enroll new Puppet 5 clients to Puppet 7 based infrastructure. I did however allow the Toolforge project to access the image, as I suspect we might need to provision new K8s nodes (and decom some grid nodes) using Buster as running them on newer OS versions still needs some more setup and testing. Further details are on the task <https://phabricator.wikimedia.org/T351499>. Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

1 0

Grid engine shutdown timeline
by Andrew Bogott 15 Nov '23

15 Nov '23

During our team meeting today we settled on the following timeline for shutting down the grid. I'm emailing here for posterity; Komla will take care of notifying our users. * This week: prepare pywikibot migration solution https://phabricator.wikimedia.org/T249787 (Taavi + Raymond) * Nextweek: media blitz (Komla) **emails ** wikitech site notice ** wikitech list email ** cloud-announce email ** wikitech talk pages * Late Nov/Early Dec: commandline notification ** jsub + company are wrapped in some way to notify users about incoming shutoff * Dec 14th: warning shot ** Tools owned by unresponsive admins or unreachable admins are stopped ** Tools run by unresponsive admins get their crontab entries commented out, and warnings inserted * Feb 14th: grid is stopped entirely ** All tools stopped, new submissions prevented * Mar 14th: grid infra deleted

1 0

Toolforge Build Service Open Beta
by Seyram Komla Sapaty 16 Oct '23

16 Oct '23

Hello Admins, We are getting ready to announce to the community that Build Service[0] is now in open beta. This follows the previous rounds of testing we carried out with selected users. Kindly take a look at the draft of the announcement email here[1]. Please review and provide any feedback you may have. The announcement will be made early next week. Thank you! [0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service [1]: https://etherpad.wikimedia.org/p/build-service-open-beta -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

3 2

New Cloud VPS restricted bastion
by Francesco Negri 28 Sep '23

28 Sep '23

Hi, we plan on moving the Cloud VPS restricted bastion [1] to a new VM based on Bookworm. The hostname will remain the same (restricted.bastion.wmcloud.org) but it will point to a new VM running Bookworm [2]. This will happen later today. If you SSH to a Cloud VPS instance after this change, you will get an error and you will have to update the fingerprint for the bastion in your "known_hosts" file. When the new server is live, I will update the fingerprints listed in wikitech [3], so please verify they match what you see in your terminal before accepting them. (Ideally this would be handled by wmf-sre-laptop, see T329322.) Thanks, Francesco [1] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Setup [2] https://phabricator.wikimedia.org/T340241#9202859 [3] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastio… -- Francesco Negri (he/him) -- IRC: dhinus Site Reliability Engineer, Cloud Services team Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin