Komla has started to disable the grid for tools that seem abandoned.
The workboard for this is at
https://phabricator.wikimedia.org/project/view/6135/ I believe that
tools are moving from 'Unreached Tool' to 'Disabled' as they are disabled.
== How to disable (or re-enable) a tool?
There are two scripts, each run in a different place. BOTH scripts
should be run for any tool. It should be safe to run any of these
commands multiple times without additional effect.
To disable the grid for a tool:
On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/disable_grid_for_tool.py <toolname>
On tools-sgecron-2.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/stop_grid_for_tool.py <toolname>
To re-enable the grid for a tool:
On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/disable_grid_for_tool.py --enable <toolname>
On tools-sgecron-2.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/stop_grid_for_tool.py --enable <toolname>
== Who can re-enable a tool, and when? ==
This shut-down phase has two goals:
1) Stop grid jobs that no one cares about
2) Provide a 'warning shot' to get attention from users or admins of a
tool who are relying on the tool but not responding to Komla's
correspondence.
Anyone with the necessary logins is encouraged re-enable tools as
needed. Specifically:
- If you are contacted by a tool admin requesting restoration, feel free
to restore the tool according to the steps above. First, though, please
make sure the concerned admin is aware that the grid is going away, and
make sure you (or better yet the admin) update the workboard task
associated with the tool explaining how they plan to deal with the
coming shut-down and how they can be contacted in the future.
- If you are contacted by users of a tool requesting restoration, please
encourage them to reach out to the admin and have the admin request
restoration directly. If it's clear that a tool is needed but has no
reachable admin, add notes to the phab task accordingly, then move the
task into the 'Help wanted' column and add 'Abandoned:' to the task title.
== What is disabling/enabling? ==
The disable scripts do the following:
- set a grid quota that prevents future jobs from being scheduled
- move grid-specific service.manifest files to 'service.disabledmanifest'
- add a 'TOOL_DISABLED' to the tool home
- archive crontab
- qdel all existing grid jobs
Enable scripts do this:
- remove restrictive grid quota, permitting jobs to be scheduled
- move 'service.disabledmanifest' back to service.manifest if no
service.manifest is currently present
- remove 'TOOL_DISABLED' file
- restore crontab
Note that the enable script do not actively start anything. So
non-webservice tools will likely require a manual start after enabling.
We ran out of capacity on the Toolforge Kuberneter cluster yesterday,
seemingly due to a large number of tools migrating from the grid engine to
Kubernetes and a temporary decrease in capacity during a cluster-wide
reboot to recover from a NFS blip. I've provisioned some extra nodes to fix
the immediate issue, but the total CPU requests are still around 90% of the
total cluster capacity. (Note that this does not mean that we're using 90%
of CPU power available there, I'll come back to this in a bit.)
*In case the cluster starts acting up again*: follow
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolfor…
to provision more capacity. That runbook also has a link to the Grafana
dashboard for cluster capacity and instructions on what specific metrics to
worry about there, given that there are no alerts for it yet
<https://phabricator.wikimedia.org/T352581>.
As I said, we seem to be overprovisioning CPUs by a lot compared to actual
usage: `kubectl sudo top node` shows a majority of nodes being below 10% of
actual CPU utilization. So in the near term we should look at tweaking the
resource allocation logic especially for web services.
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation