Cloud-admin December 2021

cloud-admin@lists.wikimedia.org

2 participants
2 discussions

Toolforge Futures meeting 2021-12-14
by Andrew Bogott 02 Mar '22

02 Mar '22

This meeting coalesced around a few major topics: * Why not just bring-your-own-container? ** We turn out to have fairly different ideas about what user experiences we want to support. ** General agreement that we could use more research and/or documentation about current and future workflows (although some of that already exists at https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences ** This question hinges on how sophisticated our users are or aren't. ** Komla suggests that many public cloud platforms provide simple deployment that don't require users to understand about container or k8s details; there's general agreement that we would also like to be like that ** Andrew thinks that push-to-deploy should be our 'main priority' and byo doesn't really address that. * How/When will we kill off the grid engine? ** We tend to think of this as blocked by push-to-deploy, but perhaps we should be open to other non-blocking options (e.g. the 'one big container' migration path) * What to do about the Stretch->Buster migration? ** Nicholas isn't convinced that we should migrate to Buster if we're just going to kill the grid eventually anyway. Andrew and Arturo mostly disagree. ** Probably the migration to Buster isn't a lot of hard mental work, just building things and throwing pre-existing switches. ** Main blocker for this is allocating time and tasking someone with doing the work ------ RAW ETHERPAD NOTES ------ == Toolforge next steps meeting 2021-12-14 == The approaching deadlines (from an announcement email on 2021-09-27) is: January 1st, 2022: * Stretch alternatives will be available for tool migration in Toolforge The proposed agenda follows. === Goals (Grid) === * Grid engine deprecation is blocked until users can customize container builds. Buildpacks are intended to address this need. * The blocker is providing equivalent runtime support for tools on the Kubernetes cluster as the current grid engine cluster has. Kubernetes containers are "thin" and tools will need the ability to add libraries and binaries that are custom to them. === Goals (Buildpacks) === ** Why are we doing this? *** Allow users to customize k8s images.. Better able to allow users to migrate off of gridengine * Arturo asks: why not just bring-your-own-container? ** Bryan answers: because bring-your-own container means containers without any toolforge integration (e.g. no ldap) * What about putting that in a base layer? ** That's a build your own container approach, which is what buildpacks is bringing ** But allowing build-your-own today is simpler than buildpacks for SREs ** Adding complexity for end users * Building your own docker image adds more complexity for end users ** Buildpacks also limit what you can put in a docker container, so potentially better security ** * How does buildpacks improve security? ** If you let someone else build a docker image, it could run as a different user and open security holes ** How could it be limited in k8s? The container runs as root on the host ** By building the container in buildpacks, we limit it ** k8s has full control over the runc runtime, so k8s could prevent user spoofing * Public clouds have bring your own container, so it must be possible right? * Would like to see a list of prioritized user workflows ** What workflows are we enabling with buildpacks? ** Consider looking at https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences * What's the long-term vision for TF? ** Push to git repo, and it "just works" ** The heroku workflow is built on buildpacks * What's the concern / fear about buildpacks? ** Complexity, can we find a way to simplify things? ** Lack of adoption (what if we end up the only ones using it) * Bringing your own container isn't a regression, but it's not a replacement for existing workflows (aka, we can't kill the grid by simply adding bring your own container) ** Why? ** Workflow and brainstate. Users will no longer understand how to run workflows. Running a job is much simpler than building and maintaining a container to run a job. Real risk of losing tools * Can we assume complex tool authors are technically capable of building a container? ** Complexity can be easily introduced in the grid; that doesn't mean they could build and maintain a container ** `webservice --backend=gridengine start` supports PHP, Python, and Perl fcgi out of the box. * Google cloud comparision -- industry seems to be using towards containers, but users don't need to build a container or even know that you are running in a container. Build a flask app, run a command, GCE builds a docker container and runs it ** all you need is for the buildpack to 'detect' what runtimes you need (it can be a file, or chekcking for a packages.lock or whatever) * So David et la -- does buildpacks seem complex? ** Buildpacks are easy-ish. Complexity is introduced by putting buildpacks into Toolforge ** Tekton / buildpacks PoC is easy * Build service ** admission controller, some custom resource definitions, hardware. Any docker registry could be used? Harbor presented some issues in running outside of k8s * How much engineering effort to bring that in prod? Is it possible to bring online this year? ** Depends on resources, and how org handles things ** Yes, if we work on it? :-) * Priority wise, a push to deploy solution is the most important / seamless thing we could work on ** Push to deploy toolforge, before grid deprecation even? ** Can't get rid of the grid until reasonable replacement ** Why? ** Because we need to support people today ** Actually need mixed runtime environment deployable on k8s * For example, 1 giant container that contains everything from a grid exec node ** So grid isn't dependent on push to deploy exactly === Goals (Buster) === * Have a plan for Buster on grid engine * Decide whether or not to have a plan for Buster on k8s (and, optionally, have a plan) * Decide what timeline adjustment is realistic; pick someone to communicate this delay to the users ==== Grid Engine ==== * Why do we hate the grid? ** When jobs are being run, little isolation. Uses watcher spawned processes + runtime hacks. ** No longer developed or supported by any upstreams ** In ~2018? looked at modern "grids" that spawned things similar, but could be better managed. (IE, slurm) ** At the time, decided k8s was the future, and decided that slurm or similar wasn't a good idea * Grid is important -- what could we do this year? ** Find someone to build a buster grid ** Make a mad dash at killing the grid asap. * Giant container is unlocked now , previously limited to 2G ** 3.1G container, needed on each k8s exec node ** Only need 1 copy on each node; shared between jobs ** When building new containers, be wary of variants when deploying new containers. Could be N x 3.1G ** Large containers aren't performant on k8s * Buster migration ** Most of the pieces are in place ** Build out nodes, switchover * Why the timeline? ** Organizational timeline ** Grid is senstive to DNS changes <discuss> <decide> <who> ==== k8s ==== <discuss> <decide> <who> ==== timeline ==== <discuss> <decide> <who> * dcaro: I propose delaying any non-urgent decision until we finish with the decision making/tech discussion email thread ** I would argue that the stretch deprecation is starting to become urgent (although I don't have context about that email thread) === Current status, open projects === The list may be the list of open projects: * stretch-to-buster migration final push ** including but not limited to the grid * grid engine deprecation timeline & plans ** draft here https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme… * pending kubernetes upgrades (if any) ** yes, current clusters are at 1.20 and 1.23 is the latest release. 1.21 is simple (and I think it would be a good opportunity for someone to learn the process) but 1.22 is complicated because it removes tons of deprecated things ** 1.20 is still in supported, support policy is last 4 releases * toolforge build service (buildpacks) ** currently we don't really have any visibility to package versions and available (security) upgrades in our images or deployed image versions - can we improve this with buildpack images or otherwise? https://phabricator.wikimedia.org/T291908 * toolforge jobs framework === Next steps, prioritization === * what to do next, and who === Long term future === * Share your ideas of how Toolforge should look in 5 years from now

3 3

triage
by Andrew Bogott 14 Dec '21

14 Dec '21

Howdy! In response to the general consensus that we are doing too many things, I'm trying to compile a list of things that we are doing (for various definitions of 'doing'). I'd appreciate people reviewing this and adding things that I've forgotten. Please treat this as more of a brainstorming exercise than an official document -- add anything you can think of, and feel free to rename my categories or re-categorize things according to your whim. https://docs.google.com/document/d/1IKtYLYRvNOQraATWTNsAD2mF14XhxrOuzAxlIJx… I confess that I'm doing this partly out of laziness (it feels easier than reading 1000 phab tickets) -- I also have a nagging suspicion that we are doing (and/or distracted by) some things that are largely invisible on phabricator. Thank you! -A

2 2

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin December 2021