TL;DR:
* https://toolsadmin.wikimedia.org now allows marking a tool as "disabled".
* Disabling a tool will immediately stop any running jobs including
webservices and prevent maintainers from logging in as the tool.
* Disabled tools are archived and deleted after 40 days.
* Disabled tools can be re-enabled at any time prior to being archived
and deleted.
"How can I delete a tool that I no longer want?" is a question that
folks have been asking for a very long time. I know of Phabricator
tasks going back to at least April 2016 [0] tracking such requests. A
bit over 5 years ago I created a Phabricator task to track figuring
out how to delete an unused tool [1]. Nearly 18 months ago Andrew
Bogott started to look into how we could automate the checklist of
cleanup steps that had been developed. By January 2022 Andrew had
implemented all of the pieces needed complete the checklist. This came
with a command line tool that Toolforge admins have been able to use
to delete a tool. Today we have released updates to Striker
(<https://toolsadmin.wikimedia.org>) which finally expose a "disable
tool" button to a tool's maintainers [2].
When a tool is marked as disabled any running jobs it has on the Grid
Engine or Kubernetes backends are stopped. Changes are also made so
that new jobs cannot be started, any crontab file is archived, and
maintainers are prevented from using `become <tool>`. Normally things
stay in this state for 40 days to give everyone a chance to change
their minds and re-enable to tool. Once the 40 day timer expires, the
system will proceed with cleanup tasks that are more difficult to
reverse including archiving and deleting the tool's $HOME and ToolsDB
databases. Ultimately the tool's group and user are deleted from the
LDAP directory which functionally completes the process.
A lot of system administration tasks are kind of boring, but this work
turned out to be actually pretty interesting. A Toolforge tool can
include quite a number of different parts. There can be jobs running
on the Grid Engine and/or Kubernetes, a crontab to start jobs
periodically, a database in ToolsDB, credentials for accessing the
Wiki Replicas, credentials for accessing the Toolforge Elasticsearch
cluster, a $HOME directory on the Toolforge NFS server, and account
information in the LDAP directory that powers Developer accounts and
Cloud VPS credentials. All of these things would ideally be removed
when a tool was successfully deleted. Some of them are things that we
would like to create historical archives of incase someone wanted to
recreate the tool's functionality. And in a perfect world we would
also be able to change our minds and start the tool back up if things
had not progressed to fully deleting the tool.
Andrew came up with a fairly elegant system to deal with this
complexity. He designed a series of processes which are each
responsible for a slice of the overall complexity. A process running
on the Grid controller is responsible for stopping running Grid Engine
jobs and changing the tool's quota so that no new jobs can be started.
A process running on the Crontab server archives the tool's crontab
configuration. A process running on the Kubernetes controller deletes
the tool's credentials for accessing the Kubernetes cluster, the
tool's namespace, and by extension removes all processes running in
the namespace. A process running on the NFS controller archives the
tool's $HOME directory contents and deletes the directory. It also
removes the tool from other LDAP membership lists (a tool can be a
co-maintainer of another tool) and deletes the tool's user and group
from the LDAP directory. A process archives ToolsDB tables. Another
process removes the tool's database credentials across the ToolsDB and
Wiki Replicas server pools. Many of these processes are implemented in
cloud/toolforge/disable-tool on Gerrit [3]. Others were added to
existing management controllers for creating Kubernetes and database
credentials. The processes all take cues from the LDAP directory and
tracking files in the tool's $HOME to create an eventually consistent,
decoupled collection of clean up actions.
We still have some work to do to update documentation on wikitech and
Phabricator so that folks know where to find the new buttons. If you
find documentation that needs to be updated before someone else gets
to it, please feel empowered to be [[WP:BOLD]] and update them.
[0]: https://phabricator.wikimedia.org/T133777
[1]: https://phabricator.wikimedia.org/T170355
[2]: https://phabricator.wikimedia.org/T285403
[3]: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/disable-tool/
[[WP:BOLD]]: https://en.wikipedia.org/wiki/Wikipedia:Be_bold
Bryan, on behalf of the Toolforge administration team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
The shared NFS servers that back toolforge have been running close to
full for a while. We are going to free up space by taking the following
steps:
- Remove all files ending with .log and .err that have not been modified
since November 1st, 2021 (e.g. find -name '*.log' -not -newermt "Nov 1,
2021" -exec rm {} \;)
- Truncate all files ending with .log and .err to a total size of 1GB.
(e.g find -name '*.log' -size +1G-exec truncate --size=1G {} \;)
We'll be running those commands on Friday of this week. If you have any
log or err files of that form that need to NOT be truncated and/or
deleted, rename them now!
Also, please take moment to run 'du' in your home and tool dirs and
delete any other files that you can live without.
Thank you!
-Andrew
As part of routine security maintenance, all Debian Bullseye VMs are due
for a reboot and kernel upgrade. I will be performing these reboots
early next week, either on Monday or Tuesday.
If you want to reboot hosts on your own time (rather than at a random
Andrew-selected time), feel free to reboot your own hosts before then.
-Andrew + the WMCS team
Debian Stretch's security support ends in mid 2022, and the Foundation's
OS policy already discourages use of existing Stretch machines. That
means that it's time for all project admins to start rebuilding your VMs
with Bullseye (or, if you must, Buster.)
Any webservices running in Kubernetes created in the last year or two
are most likely using Buster images already, so there's no action needed
for those. Older kubernetes jobs should be refreshed to use more modern
images whenever possible.
If you are still using the grid engine for webservices, we strongly
encourage you to migrate your jobs to Kubernetes. For other grid uses,
watch this space for future announcements about grid engine migration;
we don't yet have a solution prepared for that.
Details about the what and why for this process can be found here:
https://wikitech.wikimedia.org/wiki/News/Stretch_deprecation
Here is the deprecation timeline:
March 2021: Stretch VM creation disabled in most projects
July 6, 2021: Active support of Stretch ends, Stretch moves into LTS
<- You are Here ->
January 1st, 2022: Stretch VM creation disabled in all projects,
deprecation nagging begins in earnest. Stretch alternatives will be
available for tool migration in Toolforge
May 1, 2022: All active Stretch VMs will be shut down (but not deleted)
by WMCS admins. This includes Toolforge grid exec nodes.
June 30, 2022: LTS support for Debian Stretch ends, all Stretch VMs will
be deleted by WMCS admins
Hello!
Earlier this year, WMCS initiated the process to migrate tools off the
grid[0].
We also published a series of blog posts explaining further the reasoning
behind this action[1]
We encouraged maintainers to move to Kubernetes if they could but also made
available Debian Buster GridEngine for those tools who were blocked or
otherwise unable to migrate to Kubernetes at that time.
We are aware that not all workloads can easily move from the grid to
Kubernetes.[2]
For some of the current grid workflows, there may be no 1:1 functionality
match on Kubernetes.
Work is underway to address most of these issues[3]
We’re putting together a use case continuity table showing GridEngine
workloads and their equivalent Kubernetes workloads[4]
[image: case continuity.PNG]
To help track the specific migration work, we created a Phabricator
ticket(project tag: grid-engine-to-k8s-migration[5]) for each tool that is
currently running on GridEngine. With a ticket for each tool on GridEngine,
we hope to collect specific blocking issues and have the team work on
addressing them.
We encourage maintainers to reach out if you need help or find you are
blocked by missing features.
We noticed, after receiving notifications for these tickets, some of you
wondered whether the grid is being shut down immediately.
This is not the case. We will work with tool maintainers to ensure all
tools safely move off the grid(or are safely shutdown), only then will we
start looking at decommissioning the grid.
Apologies to those who felt spammed by the ticket creation process and got
worried about the future of their projects. We should have communicated
better around this process.
=== Way Forward ===
The working draft for GridEngine plans and timeline can be found here[6]
If you need further clarifications, reach out to us on the ticket for your
specific tool on Phabricator or reach out via any of our communication
channels[7]
Thanks!
----------
[0]:
https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/…
[1]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/
[2]:
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
GridEngine_plans_and_timeline#Use_case_continuity
[3]: https://phabricator.wikimedia.org/T194332
[4]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[5]: https://phabricator.wikimedia.org/project/profile/6135/
[6]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation
[7]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi there,
We are currently working on replacing older hardware servers with newer
ones, in particular those dedicated to cloud networking [0].
We have discovered a few shortcomings related mostly to network
interface naming in the newer servers, and the latest openstack version
behaving differently to what it used to be, and also some base operating
system (debian) bugs [1]. Some of these are hardware-dependant and
difficult to reproduce/anticipate in our staging environment.
The result is that we are having a more challenging and noisy migration
than we would like. We already had a few (brief) network outages trying
to introduce the new servers into service.
We'll try to keep things as stable as possible in the next few days
until the migration is completed, but we can't discard having some more
(brief) network outages until we are safely on the other side of the
transition.
I'll send another note when we finish this network maintenance is over.
regards.
[0] https://phabricator.wikimedia.org/T316284
[1] https://bugs.debian.org/989162
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation