We ran out of capacity on the Toolforge Kuberneter cluster yesterday, seemingly due to a large number of tools migrating from the grid engine to Kubernetes and a temporary decrease in capacity during a cluster-wide reboot to recover from a NFS blip. I've provisioned some extra nodes to fix the immediate issue, but the total CPU requests are still around 90% of the total cluster capacity. (Note that this does not mean that we're using 90% of CPU power available there, I'll come back to this in a bit.)
As I said, we seem to be overprovisioning CPUs by a lot compared to actual usage: `kubectl sudo top node` shows a majority of nodes being below 10% of actual CPU utilization. So in the near term we should look at tweaking the resource allocation logic especially for web services.
Taavi
-- Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation