We ran out of capacity on the Toolforge Kuberneter cluster yesterday, seemingly due to a large number of tools migrating from the grid engine to Kubernetes and a temporary decrease in capacity during a cluster-wide reboot to recover from a NFS blip. I've provisioned some extra nodes to fix the immediate issue, but the total CPU requests are still around 90% of the total cluster capacity. (Note that this does not mean that we're using 90% of CPU power available there, I'll come back to this in a bit.)

In case the cluster starts acting up again: follow https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity to provision more capacity. That runbook also has a link to the Grafana dashboard for cluster capacity and instructions on what specific metrics to worry about there, given that there are no alerts for it yet.

As I said, we seem to be overprovisioning CPUs by a lot compared to actual usage: `kubectl sudo top node` shows a majority of nodes being below 10% of actual CPU utilization. So in the near term we should look at tweaking the resource allocation logic especially for web services.

Taavi

--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation