TL;DR: * https://toolsadmin.wikimedia.org now allows marking a tool as "disabled". * Disabling a tool will immediately stop any running jobs including webservices and prevent maintainers from logging in as the tool. * Disabled tools are archived and deleted after 40 days. * Disabled tools can be re-enabled at any time prior to being archived and deleted.
"How can I delete a tool that I no longer want?" is a question that folks have been asking for a very long time. I know of Phabricator tasks going back to at least April 2016 [0] tracking such requests. A bit over 5 years ago I created a Phabricator task to track figuring out how to delete an unused tool [1]. Nearly 18 months ago Andrew Bogott started to look into how we could automate the checklist of cleanup steps that had been developed. By January 2022 Andrew had implemented all of the pieces needed complete the checklist. This came with a command line tool that Toolforge admins have been able to use to delete a tool. Today we have released updates to Striker (https://toolsadmin.wikimedia.org) which finally expose a "disable tool" button to a tool's maintainers [2].
When a tool is marked as disabled any running jobs it has on the Grid Engine or Kubernetes backends are stopped. Changes are also made so that new jobs cannot be started, any crontab file is archived, and maintainers are prevented from using `become <tool>`. Normally things stay in this state for 40 days to give everyone a chance to change their minds and re-enable to tool. Once the 40 day timer expires, the system will proceed with cleanup tasks that are more difficult to reverse including archiving and deleting the tool's $HOME and ToolsDB databases. Ultimately the tool's group and user are deleted from the LDAP directory which functionally completes the process.
A lot of system administration tasks are kind of boring, but this work turned out to be actually pretty interesting. A Toolforge tool can include quite a number of different parts. There can be jobs running on the Grid Engine and/or Kubernetes, a crontab to start jobs periodically, a database in ToolsDB, credentials for accessing the Wiki Replicas, credentials for accessing the Toolforge Elasticsearch cluster, a $HOME directory on the Toolforge NFS server, and account information in the LDAP directory that powers Developer accounts and Cloud VPS credentials. All of these things would ideally be removed when a tool was successfully deleted. Some of them are things that we would like to create historical archives of incase someone wanted to recreate the tool's functionality. And in a perfect world we would also be able to change our minds and start the tool back up if things had not progressed to fully deleting the tool.
Andrew came up with a fairly elegant system to deal with this complexity. He designed a series of processes which are each responsible for a slice of the overall complexity. A process running on the Grid controller is responsible for stopping running Grid Engine jobs and changing the tool's quota so that no new jobs can be started. A process running on the Crontab server archives the tool's crontab configuration. A process running on the Kubernetes controller deletes the tool's credentials for accessing the Kubernetes cluster, the tool's namespace, and by extension removes all processes running in the namespace. A process running on the NFS controller archives the tool's $HOME directory contents and deletes the directory. It also removes the tool from other LDAP membership lists (a tool can be a co-maintainer of another tool) and deletes the tool's user and group from the LDAP directory. A process archives ToolsDB tables. Another process removes the tool's database credentials across the ToolsDB and Wiki Replicas server pools. Many of these processes are implemented in cloud/toolforge/disable-tool on Gerrit [3]. Others were added to existing management controllers for creating Kubernetes and database credentials. The processes all take cues from the LDAP directory and tracking files in the tool's $HOME to create an eventually consistent, decoupled collection of clean up actions.
We still have some work to do to update documentation on wikitech and Phabricator so that folks know where to find the new buttons. If you find documentation that needs to be updated before someone else gets to it, please feel empowered to be [[WP:BOLD]] and update them.
[0]: https://phabricator.wikimedia.org/T133777 [1]: https://phabricator.wikimedia.org/T170355 [2]: https://phabricator.wikimedia.org/T285403 [3]: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/disable-tool/ [[WP:BOLD]]: https://en.wikipedia.org/wiki/Wikipedia:Be_bold
Bryan, on behalf of the Toolforge administration team