This meeting coalesced around a few major topics:
* Why not just bring-your-own-container?
** We turn out to have fairly different ideas about what user
experiences we want to support.
** General agreement that we could use more research and/or
documentation about current and future workflows (although some of that
already exists at
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
** This question hinges on how sophisticated our users are or aren't.
** Komla suggests that many public cloud platforms provide simple
deployment that don't require users to understand about container or k8s
details; there's general agreement that we would also like to be like that
** Andrew thinks that push-to-deploy should be our 'main priority' and
byo doesn't really address that.
* How/When will we kill off the grid engine?
** We tend to think of this as blocked by push-to-deploy, but perhaps we
should be open to other non-blocking options (e.g. the 'one big
container' migration path)
* What to do about the Stretch->Buster migration?
** Nicholas isn't convinced that we should migrate to Buster if we're
just going to kill the grid eventually anyway. Andrew and Arturo mostly
disagree.
** Probably the migration to Buster isn't a lot of hard mental work,
just building things and throwing pre-existing switches.
** Main blocker for this is allocating time and tasking someone with
doing the work
------ RAW ETHERPAD NOTES ------
== Toolforge next steps meeting 2021-12-14 ==
The approaching deadlines (from an announcement email on 2021-09-27) is:
January 1st, 2022:
* Stretch alternatives will be available for tool migration in Toolforge
The proposed agenda follows.
=== Goals (Grid) ===
* Grid engine deprecation is blocked until users can customize container
builds. Buildpacks are intended to address this need.
* The blocker is providing equivalent runtime support for tools on the
Kubernetes cluster as the current grid engine cluster has. Kubernetes
containers are "thin" and tools will need the ability to add libraries
and binaries that are custom to them.
=== Goals (Buildpacks) ===
** Why are we doing this?
*** Allow users to customize k8s images.. Better able to allow users to
migrate off of gridengine
* Arturo asks: why not just bring-your-own-container?
** Bryan answers: because bring-your-own container means containers
without any toolforge integration (e.g. no ldap)
* What about putting that in a base layer?
** That's a build your own container approach, which is what buildpacks
is bringing
** But allowing build-your-own today is simpler than buildpacks for SREs
** Adding complexity for end users
* Building your own docker image adds more complexity for end users
** Buildpacks also limit what you can put in a docker container, so
potentially better security
**
* How does buildpacks improve security?
** If you let someone else build a docker image, it could run as a
different user and open security holes
** How could it be limited in k8s? The container runs as root on the host
** By building the container in buildpacks, we limit it
** k8s has full control over the runc runtime, so k8s could prevent user
spoofing
* Public clouds have bring your own container, so it must be possible right?
* Would like to see a list of prioritized user workflows
** What workflows are we enabling with buildpacks?
** Consider looking at
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
* What's the long-term vision for TF?
** Push to git repo, and it "just works"
** The heroku workflow is built on buildpacks
* What's the concern / fear about buildpacks?
** Complexity, can we find a way to simplify things?
** Lack of adoption (what if we end up the only ones using it)
* Bringing your own container isn't a regression, but it's not a
replacement for existing workflows (aka, we can't kill the grid by
simply adding bring your own container)
** Why?
** Workflow and brainstate. Users will no longer understand how to run
workflows. Running a job is much simpler than building and maintaining a
container to run a job. Real risk of losing tools
* Can we assume complex tool authors are technically capable of building
a container?
** Complexity can be easily introduced in the grid; that doesn't mean
they could build and maintain a container
** `webservice --backend=gridengine start` supports PHP, Python, and
Perl fcgi out of the box.
* Google cloud comparision -- industry seems to be using towards
containers, but users don't need to build a container or even know that
you are running in a container. Build a flask app, run a command, GCE
builds a docker container and runs it
** all you need is for the buildpack to 'detect' what runtimes you need
(it can be a file, or chekcking for a packages.lock or whatever)
* So David et la -- does buildpacks seem complex?
** Buildpacks are easy-ish. Complexity is introduced by putting
buildpacks into Toolforge
** Tekton / buildpacks PoC is easy
* Build service
** admission controller, some custom resource definitions, hardware. Any
docker registry could be used? Harbor presented some issues in running
outside of k8s
* How much engineering effort to bring that in prod? Is it possible to
bring online this year?
** Depends on resources, and how org handles things
** Yes, if we work on it? :-)
* Priority wise, a push to deploy solution is the most important /
seamless thing we could work on
** Push to deploy toolforge, before grid deprecation even?
** Can't get rid of the grid until reasonable replacement
** Why?
** Because we need to support people today
** Actually need mixed runtime environment deployable on k8s
* For example, 1 giant container that contains everything from a grid
exec node
** So grid isn't dependent on push to deploy exactly
=== Goals (Buster) ===
* Have a plan for Buster on grid engine
* Decide whether or not to have a plan for Buster on k8s (and,
optionally, have a plan)
* Decide what timeline adjustment is realistic; pick someone to
communicate this delay to the users
==== Grid Engine ====
* Why do we hate the grid?
** When jobs are being run, little isolation. Uses watcher spawned
processes + runtime hacks.
** No longer developed or supported by any upstreams
** In ~2018? looked at modern "grids" that spawned things similar, but
could be better managed. (IE, slurm)
** At the time, decided k8s was the future, and decided that slurm or
similar wasn't a good idea
* Grid is important -- what could we do this year?
** Find someone to build a buster grid
** Make a mad dash at killing the grid asap.
* Giant container is unlocked now , previously limited to 2G
** 3.1G container, needed on each k8s exec node
** Only need 1 copy on each node; shared between jobs
** When building new containers, be wary of variants when deploying new
containers. Could be N x 3.1G
** Large containers aren't performant on k8s
* Buster migration
** Most of the pieces are in place
** Build out nodes, switchover
* Why the timeline?
** Organizational timeline
** Grid is senstive to DNS changes
<discuss>
<decide>
<who>
==== k8s ====
<discuss>
<decide>
<who>
==== timeline ====
<discuss>
<decide>
<who>
* dcaro: I propose delaying any non-urgent decision until we finish with
the decision making/tech discussion email thread
** I would argue that the stretch deprecation is starting to become
urgent (although I don't have context about that email thread)
=== Current status, open projects ===
The list may be the list of open projects:
* stretch-to-buster migration final push
** including but not limited to the grid
* grid engine deprecation timeline & plans
** draft here
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
* pending kubernetes upgrades (if any)
** yes, current clusters are at 1.20 and 1.23 is the latest release.
1.21 is simple (and I think it would be a good opportunity for someone
to learn the process) but 1.22 is complicated because it removes tons of
deprecated things
** 1.20 is still in supported, support policy is last 4 releases
* toolforge build service (buildpacks)
** currently we don't really have any visibility to package versions and
available (security) upgrades in our images or deployed image versions -
can we improve this with buildpack images or otherwise?
https://phabricator.wikimedia.org/T291908
* toolforge jobs framework
=== Next steps, prioritization ===
* what to do next, and who
=== Long term future ===
* Share your ideas of how Toolforge should look in 5 years from now
Howdy!
In response to the general consensus that we are doing too many things,
I'm trying to compile a list of things that we are doing (for various
definitions of 'doing').
I'd appreciate people reviewing this and adding things that I've
forgotten. Please treat this as more of a brainstorming exercise than an
official document -- add anything you can think of, and feel free to
rename my categories or re-categorize things according to your whim.
https://docs.google.com/document/d/1IKtYLYRvNOQraATWTNsAD2mF14XhxrOuzAxlIJx…
I confess that I'm doing this partly out of laziness (it feels easier
than reading 1000 phab tickets) -- I also have a nagging suspicion that
we are doing (and/or distracted by) some things that are largely
invisible on phabricator.
Thank you!
-A