I wanted to share an interesting failure I just saw on the Toolforge
cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I
assume that's the reason it was never deployed, although an
alternative is that the person merging the change forgot. This
published builds-api 0.0.131.
2. The harbor expiration policy noticed that builds-api 0.0.131
exists, and pruned the images for 0.0.130.
3. The certificates used for communications between the API gateway
and builds-api got renewed by cert-manager, and this triggered an
automatic restart for the builds-api deployment.
4. The new builds-api pods failed to start as the image they were
running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the
new deployment did not come up, and it stopped restarts of any further
pods and did not send any traffic to the single restarted pod.
However, the ticking time bomb for the expiring certificates remained
as the API would go down once the old certs expired, and any node
restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
* https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
* https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues
In addition we should consider setting up explicit
PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image
in the first space:
* Can we store all release tagged images indefinitely? How much
storage space would that take?
* If not, how can we prevent images still in use just disappearing
like that? How do we ensure that rollbacks will always work as
expected?
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi there,
Last year, we starting evaluating how we could refresh the way we relate
(deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0].
One of the most compelling options we found was to run Openstack inside
Kubernetes, using an upstream project called openstack-helm.
But... What if we stopped doing Openstack at all?
To clarify, the base idea I had is:
* deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters
** we know how to do it!
** this would be the base, undercloud, or bedrock, whatever.
* deploy ceph next to k8s (maybe, inside even?)
** ceph would remain the preferred network storage solution
* deploy some kind of k8s multiplexing tech
** example: https://www.vcluster.com/ but there could be others
** using this create a dedicated k8s cluster for each project, for example:
toolforge/toolsbeta/etc
* Inside this new VM-less toolforge, we can retain pretty much the same
functionalities as today:
** a container listening on 22/tcp with kubectl & toolforge cli installed can be
the login bastion
** NFS server can be run on a container, using ceph
** toolsDB can be run on a container. Can't it? Or maybe replace it with other
k8s-native solution
* If we need any of the native openstack components, for example Keystone or
Swift we may run them on an standalone fashion inside k8s.
* We already have some base infrastructure (and knowledge) that would support
this model. We have cloudlbs, cloudgw, we know how to do ceph, etc.
* And finally, and most important: the community. The main question could be:
** Is there any software running on Cloud VPS virtual machines that cannot run
on a container in kubernetes?
I wanted to start this email hoping that I would collect a list of use cases,
blockers, and strong opinions about why running Openstack is important (or not).
I'm pretty sure I'm overlooking some important thing.
I plan to document all this on wikitech, and/or maybe phabricator.
You may ask: and why stop doing openstack? I will answer that in a different
email to keep this one as short as possible.
Looking forward to your counter-arguments.
Thanks!
regards.
[0]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
There are now two new Toolforge Kubernetes workers in service,
tools-k8s-worker-nfs-1 and tools-k8s-worker-nfs-2. In addition to the new
naming scheme that will allow non-NFS workers in the future, these hosts
are also running Debian 12 (as opposed to Debian 10 on the existing nodes)
and are using Containerd as the container runtime (the current nodes are
using Docker).
If you see or hear about any strange issues on pods running on these new
nodes, please depool them (`kubectl sudo drain $WORKER` on a Toolforge
bastion) and ping me on IRC or in the task (
https://phabricator.wikimedia.org/T284656).
If there are no major issues I will start replacing more of the older nodes
with these new nodes next week.
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi all!
Lately we released the public beta for the build service, that allows users
to build images from their language of choice package manager settings (for
supported languages), with minimal configuration.
While we are still fixing bugs and corner cases, we should start thinking
on the next steps for toolforge.
I have started a document with some ideas on how toolforge could look like
in the future for users, and I want to gather other ideas and feedback so
we can write down a high level view to help define the next round of
improvements to focus on.
The document is here:
https://docs.google.com/document/d/1sqo6YGRn9u-S7V0y9m07cYKA84vQlKa-7_F8p7e…
And it already takes some of the ideas from previous documents/meetings/etc.
The timeline is to brainstorm there until Tuesday next week, and then in
the Toolforge Workgroup check-in we can start trying to coalesce the main
ideas to a workable plan.
Note that this is not meant to be written on stone, but instead we will
review it a couple times a year to make sure that we are doing what we want
to do, and change course if needed.
Thanks!
Komla has started to disable the grid for tools that seem abandoned.
The workboard for this is at
https://phabricator.wikimedia.org/project/view/6135/ I believe that
tools are moving from 'Unreached Tool' to 'Disabled' as they are disabled.
== How to disable (or re-enable) a tool?
There are two scripts, each run in a different place. BOTH scripts
should be run for any tool. It should be safe to run any of these
commands multiple times without additional effect.
To disable the grid for a tool:
On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/disable_grid_for_tool.py <toolname>
On tools-sgecron-2.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/stop_grid_for_tool.py <toolname>
To re-enable the grid for a tool:
On tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/disable_grid_for_tool.py --enable <toolname>
On tools-sgecron-2.tools.eqiad1.wikimedia.cloud
$ sudo /srv/disable-tool/stop_grid_for_tool.py --enable <toolname>
== Who can re-enable a tool, and when? ==
This shut-down phase has two goals:
1) Stop grid jobs that no one cares about
2) Provide a 'warning shot' to get attention from users or admins of a
tool who are relying on the tool but not responding to Komla's
correspondence.
Anyone with the necessary logins is encouraged re-enable tools as
needed. Specifically:
- If you are contacted by a tool admin requesting restoration, feel free
to restore the tool according to the steps above. First, though, please
make sure the concerned admin is aware that the grid is going away, and
make sure you (or better yet the admin) update the workboard task
associated with the tool explaining how they plan to deal with the
coming shut-down and how they can be contacted in the future.
- If you are contacted by users of a tool requesting restoration, please
encourage them to reach out to the admin and have the admin request
restoration directly. If it's clear that a tool is needed but has no
reachable admin, add notes to the phab task accordingly, then move the
task into the 'Help wanted' column and add 'Abandoned:' to the task title.
== What is disabling/enabling? ==
The disable scripts do the following:
- set a grid quota that prevents future jobs from being scheduled
- move grid-specific service.manifest files to 'service.disabledmanifest'
- add a 'TOOL_DISABLED' to the tool home
- archive crontab
- qdel all existing grid jobs
Enable scripts do this:
- remove restrictive grid quota, permitting jobs to be scheduled
- move 'service.disabledmanifest' back to service.manifest if no
service.manifest is currently present
- remove 'TOOL_DISABLED' file
- restore crontab
Note that the enable script do not actively start anything. So
non-webservice tools will likely require a manual start after enabling.
We ran out of capacity on the Toolforge Kuberneter cluster yesterday,
seemingly due to a large number of tools migrating from the grid engine to
Kubernetes and a temporary decrease in capacity during a cluster-wide
reboot to recover from a NFS blip. I've provisioned some extra nodes to fix
the immediate issue, but the total CPU requests are still around 90% of the
total cluster capacity. (Note that this does not mean that we're using 90%
of CPU power available there, I'll come back to this in a bit.)
*In case the cluster starts acting up again*: follow
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolfor…
to provision more capacity. That runbook also has a link to the Grafana
dashboard for cluster capacity and instructions on what specific metrics to
worry about there, given that there are no alerts for it yet
<https://phabricator.wikimedia.org/T352581>.
As I said, we seem to be overprovisioning CPUs by a lot compared to actual
usage: `kubectl sudo top node` shows a majority of nodes being below 10% of
actual CPU utilization. So in the near term we should look at tweaking the
resource allocation logic especially for web services.
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
I just restricted the Debian 10 Buster Cloud VPS image to the Toolforge
project. Puppet 7 packages aren't available for Buster and it won't be
possible to enroll new Puppet 5 clients to Puppet 7 based infrastructure.
I did however allow the Toolforge project to access the image, as I suspect
we might need to provision new K8s nodes (and decom some grid nodes) using
Buster as running them on newer OS versions still needs some more setup and
testing.
Further details are on the task <https://phabricator.wikimedia.org/T351499>.
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
During our team meeting today we settled on the following timeline for
shutting down the grid. I'm emailing here for posterity; Komla will take
care of notifying our users.
* This week: prepare pywikibot migration solution
https://phabricator.wikimedia.org/T249787 (Taavi + Raymond)
* Nextweek: media blitz (Komla)
**emails
** wikitech site notice
** wikitech list email
** cloud-announce email
** wikitech talk pages
* Late Nov/Early Dec: commandline notification
** jsub + company are wrapped in some way to notify users about incoming
shutoff
* Dec 14th: warning shot
** Tools owned by unresponsive admins or unreachable admins are stopped
** Tools run by unresponsive admins get their crontab entries commented
out, and warnings inserted
* Feb 14th: grid is stopped entirely
** All tools stopped, new submissions prevented
* Mar 14th: grid infra deleted
Hello Admins,
We are getting ready to announce to the community that Build Service[0] is
now in open beta.
This follows the previous rounds of testing we carried out with selected
users.
Kindly take a look at the draft of the announcement email here[1].
Please review and provide any feedback you may have.
The announcement will be made early next week.
Thank you!
[0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
[1]: https://etherpad.wikimedia.org/p/build-service-open-beta
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi,
we plan on moving the Cloud VPS restricted bastion [1] to a new VM
based on Bookworm. The hostname will remain the same
(restricted.bastion.wmcloud.org) but it will point to a new VM running
Bookworm [2].
This will happen later today. If you SSH to a Cloud VPS instance after
this change, you will get an error and you will have to update the
fingerprint for the bastion in your "known_hosts" file.
When the new server is live, I will update the fingerprints listed in
wikitech [3], so please verify they match what you see in your
terminal before accepting them. (Ideally this would be handled by
wmf-sre-laptop, see T329322.)
Thanks,
Francesco
[1] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Setup
[2] https://phabricator.wikimedia.org/T340241#9202859
[3] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastio…
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation