Cloud-announce October 2022

cloud-announce@lists.wikimedia.org

4 participants
6 discussions

[Toolforge] Self-service tool deletion finally arrives!

by Bryan Davis

TL;DR: * https://toolsadmin.wikimedia.org now allows marking a tool as "disabled". * Disabling a tool will immediately stop any running jobs including webservices and prevent maintainers from logging in as the tool. * Disabled tools are archived and deleted after 40 days. * Disabled tools can be re-enabled at any time prior to being archived and deleted. "How can I delete a tool that I no longer want?" is a question that folks have been asking for a very long time. I know of Phabricator tasks going back to at least April 2016 [0] tracking such requests. A bit over 5 years ago I created a Phabricator task to track figuring out how to delete an unused tool [1]. Nearly 18 months ago Andrew Bogott started to look into how we could automate the checklist of cleanup steps that had been developed. By January 2022 Andrew had implemented all of the pieces needed complete the checklist. This came with a command line tool that Toolforge admins have been able to use to delete a tool. Today we have released updates to Striker (<https://toolsadmin.wikimedia.org>) which finally expose a "disable tool" button to a tool's maintainers [2]. When a tool is marked as disabled any running jobs it has on the Grid Engine or Kubernetes backends are stopped. Changes are also made so that new jobs cannot be started, any crontab file is archived, and maintainers are prevented from using `become <tool>`. Normally things stay in this state for 40 days to give everyone a chance to change their minds and re-enable to tool. Once the 40 day timer expires, the system will proceed with cleanup tasks that are more difficult to reverse including archiving and deleting the tool's $HOME and ToolsDB databases. Ultimately the tool's group and user are deleted from the LDAP directory which functionally completes the process. A lot of system administration tasks are kind of boring, but this work turned out to be actually pretty interesting. A Toolforge tool can include quite a number of different parts. There can be jobs running on the Grid Engine and/or Kubernetes, a crontab to start jobs periodically, a database in ToolsDB, credentials for accessing the Wiki Replicas, credentials for accessing the Toolforge Elasticsearch cluster, a $HOME directory on the Toolforge NFS server, and account information in the LDAP directory that powers Developer accounts and Cloud VPS credentials. All of these things would ideally be removed when a tool was successfully deleted. Some of them are things that we would like to create historical archives of incase someone wanted to recreate the tool's functionality. And in a perfect world we would also be able to change our minds and start the tool back up if things had not progressed to fully deleting the tool. Andrew came up with a fairly elegant system to deal with this complexity. He designed a series of processes which are each responsible for a slice of the overall complexity. A process running on the Grid controller is responsible for stopping running Grid Engine jobs and changing the tool's quota so that no new jobs can be started. A process running on the Crontab server archives the tool's crontab configuration. A process running on the Kubernetes controller deletes the tool's credentials for accessing the Kubernetes cluster, the tool's namespace, and by extension removes all processes running in the namespace. A process running on the NFS controller archives the tool's $HOME directory contents and deletes the directory. It also removes the tool from other LDAP membership lists (a tool can be a co-maintainer of another tool) and deletes the tool's user and group from the LDAP directory. A process archives ToolsDB tables. Another process removes the tool's database credentials across the ToolsDB and Wiki Replicas server pools. Many of these processes are implemented in cloud/toolforge/disable-tool on Gerrit [3]. Others were added to existing management controllers for creating Kubernetes and database credentials. The processes all take cues from the LDAP directory and tracking files in the tool's $HOME to create an eventually consistent, decoupled collection of clean up actions. We still have some work to do to update documentation on wikitech and Phabricator so that folks know where to find the new buttons. If you find documentation that needs to be updated before someone else gets to it, please feel empowered to be [[WP:BOLD]] and update them. [0]: https://phabricator.wikimedia.org/T133777 [1]: https://phabricator.wikimedia.org/T170355 [2]: https://phabricator.wikimedia.org/T285403 [3]: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/disable-tool/ [[WP:BOLD]]: https://en.wikipedia.org/wiki/Wikipedia:Be_bold Bryan, on behalf of the Toolforge administration team -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

1 year, 6 months

toolforge disk space cleanup: old .log and .err files to be deleted

by Andrew Bogott

The shared NFS servers that back toolforge have been running close to full for a while. We are going to free up space by taking the following steps: - Remove all files ending with .log and .err that have not been modified since November 1st, 2021 (e.g. find -name '*.log' -not -newermt "Nov 1, 2021" -exec rm {} \;) - Truncate all files ending with .log and .err to a total size of 1GB. (e.g find -name '*.log' -size +1G-exec truncate --size=1G {} \;) We'll be running those commands on Friday of this week. If you have any log or err files of that form that need to NOT be truncated and/or deleted, rename them now! Also, please take moment to run 'du' in your home and tool dirs and delete any other files that you can live without. Thank you! -Andrew

1 year, 6 months

VM reboots coming on Monday (bullseye only)

by Andrew Bogott

As part of routine security maintenance, all Debian Bullseye VMs are due for a reboot and kernel upgrade. I will be performing these reboots early next week, either on Monday or Tuesday. If you want to reboot hosts on your own time (rather than at a random Andrew-selected time), feel free to reboot your own hosts before then. -Andrew + the WMCS team

1 year, 6 months

Eliminating Debian Stretch in Cloud VPS

by Andrew Bogott

Debian Stretch's security support ends in mid 2022, and the Foundation's OS policy already discourages use of existing Stretch machines. That means that it's time for all project admins to start rebuilding your VMs with Bullseye (or, if you must, Buster.) Any webservices running in Kubernetes created in the last year or two are most likely using Buster images already, so there's no action needed for those. Older kubernetes jobs should be refreshed to use more modern images whenever possible. If you are still using the grid engine for webservices, we strongly encourage you to migrate your jobs to Kubernetes. For other grid uses, watch this space for future announcements about grid engine migration; we don't yet have a solution prepared for that. Details about the what and why for this process can be found here: https://wikitech.wikimedia.org/wiki/News/Stretch_deprecation Here is the deprecation timeline: March 2021: Stretch VM creation disabled in most projects July 6, 2021: Active support of Stretch ends, Stretch moves into LTS <- You are Here -> January 1st, 2022: Stretch VM creation disabled in all projects, deprecation nagging begins in earnest. Stretch alternatives will be available for tool migration in Toolforge May 1, 2022: All active Stretch VMs will be shut down (but not deleted) by WMCS admins. This includes Toolforge grid exec nodes. June 30, 2022: LTS support for Debian Stretch ends, all Stretch VMs will be deleted by WMCS admins

1 year, 6 months

Migrating Tools Off Grid Engine And The Way Forward

by Seyram Komla Sapaty

Hello! Earlier this year, WMCS initiated the process to migrate tools off the grid[0]. We also published a series of blog posts explaining further the reasoning behind this action[1] We encouraged maintainers to move to Kubernetes if they could but also made available Debian Buster GridEngine for those tools who were blocked or otherwise unable to migrate to Kubernetes at that time. We are aware that not all workloads can easily move from the grid to Kubernetes.[2] For some of the current grid workflows, there may be no 1:1 functionality match on Kubernetes. Work is underway to address most of these issues[3] We’re putting together a use case continuity table showing GridEngine workloads and their equivalent Kubernetes workloads[4] [image: case continuity.PNG] To help track the specific migration work, we created a Phabricator ticket(project tag: grid-engine-to-k8s-migration[5]) for each tool that is currently running on GridEngine. With a ticket for each tool on GridEngine, we hope to collect specific blocking issues and have the team work on addressing them. We encourage maintainers to reach out if you need help or find you are blocked by missing features. We noticed, after receiving notifications for these tickets, some of you wondered whether the grid is being shut down immediately. This is not the case. We will work with tool maintainers to ensure all tools safely move off the grid(or are safely shutdown), only then will we start looking at decommissioning the grid. Apologies to those who felt spammed by the ticket creation process and got worried about the future of their projects. We should have communicated better around this process. === Way Forward === The working draft for GridEngine plans and timeline can be found here[6] If you need further clarifications, reach out to us on the ticket for your specific tool on Phabricator or reach out via any of our communication channels[7] Thanks! ---------- [0]: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/… [1]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/ [2]: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme… GridEngine_plans_and_timeline#Use_case_continuity [3]: https://phabricator.wikimedia.org/T194332 [4]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#… [5]: https://phabricator.wikimedia.org/project/profile/6135/ [6]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation [7]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun… -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

1 year, 6 months

Network maintenance

by Arturo Borrero Gonzalez

Hi there, We are currently working on replacing older hardware servers with newer ones, in particular those dedicated to cloud networking [0]. We have discovered a few shortcomings related mostly to network interface naming in the newer servers, and the latest openstack version behaving differently to what it used to be, and also some base operating system (debian) bugs [1]. Some of these are hardware-dependant and difficult to reproduce/anticipate in our staging environment. The result is that we are having a more challenging and noisy migration than we would like. We already had a few (brief) network outages trying to introduce the new servers into service. We'll try to keep things as stable as possible in the next few days until the migration is completed, but we can't discard having some more (brief) network outages until we are safely on the other side of the transition. I'll send another note when we finish this network maintenance is over. regards. [0] https://phabricator.wikimedia.org/T316284 [1] https://bugs.debian.org/989162 -- Arturo Borrero Gonzalez Senior Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

1 year, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce October 2022