[Cloud-admin] Re: What if we stop doing Openstack

1 Mar 2024

      Thank you for starting the discussion! I'll start by saying that while
I'm not convinced by the arguments for moving everything from
OpenStack to containers, I tend to agree that OpenStack does have some
features we should reconsider whether we want to offer them. For
example, my current feeling is that the half-baked PostgreSQL support
in Trove is a net negative considering incidents like [0] which both
take a significant amount of admin time to troubleshoot and resolve
and as a result are causing quite a bit of downtime for our users.
[0]: https://phabricator.wikimedia.org/T355138
First, I am highly sceptical of the Kubernetes cluster-in-cluster
solutions you mention. As far as we should be concerned as
infrastructure operators, having the ability to run arbitrary
workloads in a Kubernetes cluster is equivalent to full root on all
the worker nodes. (The most obvious example is the ability to run a
pod as root that mounts /etc/passwd or a similarly privileged file as
read-write as a hostPath, but that's far from the only way to break
out of the pod sandbox.) We solve this in Toolforge with PSPs that
severely restrict the configurations of pods that are allowed to run.
I am having a hard time imagining how to implement a shared cluster
without those Toolforge-level restrictions that would keep that strong
tenant isolation that at least I consider a hard requirement for any
of our offerings.
From my experience maintaining the K8s cluster in Toolforge, I can say
that upgrading that cluster is consistently one of the most stressful
things I do around here, and that's with the majority of the workload
being managed by us. Yes, the process is rather simple now, but
compared to OpenStack, Kubernetes has a very active role in the
continued running of already existing workloads. The blast radius of a
Kubernetes upgrade going wrong is much higher than, say, a Nova
upgrade going wrong which would affect starting new VMs and stopping
existing ones but generally would not touch VMs already running in
libvirt. So far there's only been one (if I remember correctly)
upgrade-related major service degradation that made it to live
Toolforge[1], but I would credit that more to the slow speed of our
upgrades and the countless number of hours I've spent reading
changelogs and docs and testing the upgrades locally and in toolsbeta.
And as a reminder, we're currently about two years behind Kubernetes
releases and don't seem to be catching up even after upstream reduced
from 4 1.x releases a year to 3.
[1]: https://phabricator.wikimedia.org/T308189
The fact that Toolforge K8s runs in VMs is very helpful due to the
flexibility it gives - if I want to test a particular combination of
worker configuration I can currently just spin up a VM instead of
having to figure out where to find hardware for that (like I'm
currently having to think for the OVS tests). Also, many projects are
just too small to need dedicated hardware. For example, LibUp[2], a
project I recently became involved with and that doesn't neatly fit
into Toolforge at the moment, currently uses about 10 vCPUs and about
20 GiB of RAM - we can stuff about maybe 10-20 of projects of that
size onto 1U of rack space on a modern high-spec virtualization node
so giving it dedicated hardware to run a K8s cluster on top of just
does not make sense. And yes, you could replace OpenStack with
something like Ganeti with much less management overhead, but my gut
feeling is that Nova and Neutron and the other 'core' services
involved in running traditional VMs are relatively well-behaving
compared to the newer stuff, and also give us useful features (like
multi-tenancy, and instance isolation from the management/wikiland
networks) that we'd have to invent ourselves if we ditched OpenStack.
[2]: https://www.mediawiki.org/wiki/LibUp
And finally, I disagree with the statement that maintaining a Linux
server is more difficult than running something in Kubernetes (even if
someone else maintains the cluster itself). At least in my mind a
modern Kubernetes deployment has a million more moving parts than a
simple Linux server where our users can SSH to and apt-get install a
web server to run their app with.
Taavi
On Thu, Feb 29, 2024 at 7:12 PM Arturo Borrero Gonzalez
aborrero@wikimedia.org wrote:
...
Hi there,
Last year, we starting evaluating how we could refresh the way we relate
(deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0].
One of the most compelling options we found was to run Openstack inside
Kubernetes, using an upstream project called openstack-helm.
But... What if we stopped doing Openstack at all?
To clarify, the base idea I had is:

deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters

** we know how to do it!
** this would be the base, undercloud, or bedrock, whatever.

deploy ceph next to k8s (maybe, inside even?)

** ceph would remain the preferred network storage solution

deploy some kind of k8s multiplexing tech

** example: https://www.vcluster.com/ but there could be others
** using this create a dedicated k8s cluster for each project, for example:
toolforge/toolsbeta/etc

Inside this new VM-less toolforge, we can retain pretty much the same

functionalities as today:
** a container listening on 22/tcp with kubectl & toolforge cli installed can be
the login bastion
** NFS server can be run on a container, using ceph
** toolsDB can be run on a container. Can't it? Or maybe replace it with other
k8s-native solution

If we need any of the native openstack components, for example Keystone or

Swift we may run them on an standalone fashion inside k8s.

We already have some base infrastructure (and knowledge) that would support

this model. We have cloudlbs, cloudgw, we know how to do ceph, etc.

And finally, and most important: the community. The main question could be:

** Is there any software running on Cloud VPS virtual machines that cannot run
on a container in kubernetes?
I wanted to start this email hoping that I would collect a list of use cases,
blockers, and strong opinions about why running Openstack is important (or not).
I'm pretty sure I'm overlooking some important thing.
I plan to document all this on wikitech, and/or maybe phabricator.
You may ask: and why stop doing openstack? I will answer that in a different
email to keep this one as short as possible.
Looking forward to your counter-arguments.
Thanks!
regards.
[0]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhancemen...
_______________________________________________
Cloud-admin mailing list -- cloud-admin@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/

2024

2023

2022

2021

2020

2019

2018

2017

[Cloud-admin] Re: What if we stop doing Openstack