Cloud-admin December 2023

cloud-admin@lists.wikimedia.org

2 participants
2 discussions

de-gridding abandoned tools
by Andrew Bogott 19 Dec '23

19 Dec '23

Komla has started to disable the grid for tools that seem abandoned. The workboard for this is at https://phabricator.wikimedia.org/project/view/6135/ I believe that tools are moving from 'Unreached Tool' to 'Disabled' as they are disabled. == How to disable (or re-enable) a tool? There are two scripts, each run in a different place. BOTH scripts should be run for any tool. It should be safe to run any of these commands multiple times without additional effect. To disable the grid for a tool: On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/disable_grid_for_tool.py <toolname> On tools-sgecron-2.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/stop_grid_for_tool.py <toolname> To re-enable the grid for a tool: On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/disable_grid_for_tool.py --enable <toolname> On tools-sgecron-2.tools.eqiad1.wikimedia.cloud $ sudo /srv/disable-tool/stop_grid_for_tool.py --enable <toolname> == Who can re-enable a tool, and when? == This shut-down phase has two goals: 1) Stop grid jobs that no one cares about 2) Provide a 'warning shot' to get attention from users or admins of a tool who are relying on the tool but not responding to Komla's correspondence. Anyone with the necessary logins is encouraged re-enable tools as needed. Specifically: - If you are contacted by a tool admin requesting restoration, feel free to restore the tool according to the steps above. First, though, please make sure the concerned admin is aware that the grid is going away, and make sure you (or better yet the admin) update the workboard task associated with the tool explaining how they plan to deal with the coming shut-down and how they can be contacted in the future. - If you are contacted by users of a tool requesting restoration, please encourage them to reach out to the admin and have the admin request restoration directly. If it's clear that a tool is needed but has no reachable admin, add notes to the phab task accordingly, then move the task into the 'Help wanted' column and add 'Abandoned:' to the task title. == What is disabling/enabling? == The disable scripts do the following: - set a grid quota that prevents future jobs from being scheduled - move grid-specific service.manifest files to 'service.disabledmanifest' - add a 'TOOL_DISABLED' to the tool home - archive crontab - qdel all existing grid jobs Enable scripts do this: - remove restrictive grid quota, permitting jobs to be scheduled - move 'service.disabledmanifest' back to service.manifest if no service.manifest is currently present - remove 'TOOL_DISABLED' file - restore crontab Note that the enable script do not actively start anything. So non-webservice tools will likely require a manual start after enabling.

1 0

Toolforge Kubernetes cluster capacity issues
by Taavi Väänänen 02 Dec '23

02 Dec '23

We ran out of capacity on the Toolforge Kuberneter cluster yesterday, seemingly due to a large number of tools migrating from the grid engine to Kubernetes and a temporary decrease in capacity during a cluster-wide reboot to recover from a NFS blip. I've provisioned some extra nodes to fix the immediate issue, but the total CPU requests are still around 90% of the total cluster capacity. (Note that this does not mean that we're using 90% of CPU power available there, I'll come back to this in a bit.) *In case the cluster starts acting up again*: follow https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolfor… to provision more capacity. That runbook also has a link to the Grafana dashboard for cluster capacity and instructions on what specific metrics to worry about there, given that there are no alerts for it yet <https://phabricator.wikimedia.org/T352581>. As I said, we seem to be overprovisioning CPUs by a lot compared to actual usage: `kubectl sudo top node` shows a majority of nodes being below 10% of actual CPU utilization. So in the near term we should look at tweaking the resource allocation logic especially for web services. Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin December 2023