Toolforge Futures meeting 2021-12-14 - Cloud-admin

15 Dec 2021


      This meeting coalesced around a few major topics:
* Why not just bring-your-own-container?
** We turn out to have fairly different ideas about what user 
experiences we want to support.
** General agreement that we could use more research and/or 
documentation about current and future workflows (although some of that 
already exists at 
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
** This question hinges on how sophisticated our users are or aren't.
** Komla suggests that many public cloud platforms provide simple 
deployment that don't require users to understand about container or k8s 
details; there's general agreement that we would also like to be like that
** Andrew thinks that push-to-deploy should be our 'main priority' and 
byo doesn't really address that.
* How/When will we kill off the grid engine?
** We tend to think of this as blocked by push-to-deploy, but perhaps we 
should be open to other non-blocking options (e.g. the 'one big 
container' migration path)
* What to do about the Stretch->Buster migration?
** Nicholas isn't convinced that we should migrate to Buster if we're 
just going to kill the grid eventually anyway. Andrew and Arturo mostly 
disagree.
** Probably the migration to Buster isn't a lot of hard mental work, 
just building things and throwing pre-existing switches.
** Main blocker for this is allocating time and tasking someone with 
doing the work
------ RAW ETHERPAD NOTES ------
== Toolforge next steps meeting 2021-12-14 ==
The approaching deadlines (from an announcement email on 2021-09-27) is:
January 1st, 2022:
* Stretch alternatives will be available for tool migration in Toolforge
The proposed agenda follows.
=== Goals (Grid) ===
* Grid engine deprecation is blocked until users can customize container 
builds. Buildpacks are intended to address this need.
* The blocker is providing equivalent runtime support for tools on the 
Kubernetes cluster as the current grid engine cluster has. Kubernetes 
containers are "thin" and tools will need the ability to add libraries 
and binaries that are custom to them.
=== Goals (Buildpacks) ===
** Why are we doing this?
*** Allow users to customize k8s images.. Better able to allow users to 
migrate off of gridengine
* Arturo asks: why not just bring-your-own-container?
** Bryan answers: because bring-your-own container means containers 
without any toolforge integration (e.g. no ldap)
* What about putting that in a base layer?
** That's a build your own container approach, which is what buildpacks 
is bringing
** But allowing build-your-own today is simpler than buildpacks for SREs
** Adding complexity for end users
* Building your own docker image adds more complexity for end users
** Buildpacks also limit what you can put in a docker container, so 
potentially better security
**
* How does buildpacks improve security?
** If you let someone else build a docker image, it could run as a 
different user and open security holes
** How could it be limited in k8s? The container runs as root on the host
** By building the container in buildpacks, we limit it
** k8s has full control over the runc runtime, so k8s could prevent user 
spoofing
* Public clouds have bring your own container, so it must be possible right?
* Would like to see a list of prioritized user workflows
** What workflows are we enabling with buildpacks?
** Consider looking at 
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
* What's the long-term vision for TF?
** Push to git repo, and it  "just works"
** The heroku workflow is built on buildpacks
* What's the concern / fear about buildpacks?
** Complexity, can we find a way to simplify things?
** Lack of adoption (what if we end up the only ones using it)
* Bringing your own container isn't a regression, but it's not a 
replacement for existing workflows (aka, we can't kill the grid by 
simply adding bring your own container)
** Why?
** Workflow and brainstate. Users will no longer understand how to run 
workflows. Running a job is much simpler than building and maintaining a 
container to run a job. Real risk of losing tools
* Can we assume complex tool authors are technically capable of building 
a container?
** Complexity can be easily introduced in the grid; that doesn't mean 
they could build and maintain a container
** `webservice --backend=gridengine start` supports PHP, Python, and 
Perl fcgi out of the box.
* Google cloud comparision -- industry seems to be using towards 
containers, but users don't need to build a container or even know that 
you are running in a container. Build a flask app, run a command, GCE 
builds a docker container and runs it
** all you need is for the buildpack to 'detect' what runtimes you need 
(it can be a file, or chekcking for a packages.lock or whatever)
* So David et la -- does buildpacks seem complex?
** Buildpacks are easy-ish. Complexity is introduced by putting 
buildpacks into Toolforge
** Tekton / buildpacks PoC is easy
* Build service
** admission controller, some custom resource definitions, hardware. Any 
docker registry could be used? Harbor presented some issues in running 
outside of k8s
* How much engineering effort to bring that in prod? Is it possible to 
bring online this year?
** Depends on resources, and how org handles things
** Yes, if we work on it? :-)
* Priority wise, a push to deploy solution is the most important / 
seamless thing we could work on
** Push to deploy toolforge, before grid deprecation even?
** Can't get rid of the grid until reasonable replacement
** Why?
** Because we need to support people today
** Actually need mixed runtime environment deployable on k8s
* For example, 1 giant container that contains everything from a grid 
exec node
** So grid isn't dependent on push to deploy exactly
=== Goals (Buster) ===
* Have a plan for Buster on grid engine
* Decide whether or not to have a plan for Buster on k8s (and, 
optionally, have a plan)
* Decide what timeline adjustment is realistic; pick someone to 
communicate this delay to the users
==== Grid Engine ====
* Why do we hate the grid?
** When jobs are being run, little isolation. Uses watcher spawned 
processes + runtime hacks.
** No longer developed or supported by any upstreams
** In ~2018? looked at modern "grids" that spawned things similar, but 
could be better managed. (IE, slurm)
** At the time, decided k8s was the future, and decided that slurm or 
similar wasn't a good idea
* Grid is important -- what could we do this year?
** Find someone to build a buster grid
** Make a mad dash at killing the grid asap.
* Giant container is unlocked now , previously limited to 2G
** 3.1G container, needed on each k8s exec node
** Only need 1 copy on each node; shared between jobs
** When building new containers, be wary of variants when deploying new 
containers. Could be N x 3.1G
** Large containers aren't performant on k8s
* Buster migration
** Most of the pieces are in place
** Build out nodes, switchover
* Why the timeline?
** Organizational timeline
** Grid is senstive to DNS changes
<discuss>
<decide>
<who>
==== k8s ====
<discuss>
<decide>
<who>
==== timeline ====
<discuss>
<decide>
<who>
* dcaro: I propose delaying any non-urgent decision until we finish with 
the decision making/tech discussion email thread
** I would argue that the stretch deprecation is starting to become 
urgent (although I don't have context about that email thread)
=== Current status, open projects ===
The list may be the list of open projects:
* stretch-to-buster migration final push
** including but not limited to the grid
* grid engine deprecation timeline & plans
** draft here 
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhancemen...
* pending kubernetes upgrades (if any)
** yes, current clusters are at 1.20 and 1.23 is the latest release. 
1.21 is simple (and I think it would be a good opportunity for someone 
to learn the process) but 1.22 is complicated because it removes tons of 
deprecated things
** 1.20 is still in supported, support policy is last 4 releases
* toolforge build service (buildpacks)
** currently we don't really have any visibility to package versions and 
available (security) upgrades in our images or deployed image versions - 
can we improve this with buildpack images or otherwise? 
https://phabricator.wikimedia.org/T291908
* toolforge jobs framework
=== Next steps, prioritization ===
* what to do next, and who
=== Long term future ===
* Share your ideas of how Toolforge should look in 5 years from now