I've merged (what I think are) the final patches required for using
custom/vanity domains with the Cloud VPS web proxy. Here is an example:
https://wmcs-proxy-test.taavivaananen.fi/
And administrator documentation is available at [0]. [1] is the task
tracking the implementation of this.
[0]:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable…
[1]: https://phabricator.wikimedia.org/T342398
I have not yet documented this in the user-facing docs, because we first
need to decide which projects can use this feature. Historically the use
of custom domains for Cloud VPS projects has been restricted by the fact
that those required a floating IPv4 and we don't have many of those. My
feeling (but I haven't checked) is that the vast majority of granted
requests from the time I've been here have been for affiliates and for
projects that are migrating from some external hosting with an existing
domain to Cloud VPS.
Now that IPv4 scarcity is no longer a factor in this, we could in theory
set up custom domains for everyone that wants one. Are we willing to do
this or do we want to keep some requirements for having one? In my head
the biggest argument for encouraging/requiring use of *.wmcloud.org is
that it removes a major SPOF possibility from individual maintainers
having control of vanity domains and then disappearing leaving the
project stuck.
Taavi
Hi there,
we are now tracking some parts of our Cloud VPS infra using opentofu.
We have a repository [0] and some docs on wikitech [1].
As of this writing, we have support for a bunch of resources in tofu-infra, and
we consider it to be the source of truth for at least the following elements:
* nova flavors
* neutron networks, subnets, routers, routers ports and security groups
* OpenStack projects
* DNS zones, and some DNS records
Extending coverage to more resource types is in the roadmap [2].
We are in a transition period. There are a bunch of resources that have been
migrated to tofu-infra, but others will be imported "as we go", because
importing everything in one go is too heavy.
That being said, if you see yourself wanting to create or modify any of the
resources mentioned above, you should do via tofu-infra. Ask for help if in doubt.
Be warned that some cookbooks, docs or other code bits may need update. Small
regressions to some of our admin workflows are somewhat expected, as you may be
the first one to eg, create a new project using tofu-infra, or create a new
flavor using tofu-infra.
Additionally, I have been conducting a few cleanups in codfw1dev [4], for stuff
like projects and security groups, with the goal of making this tofu-infra
transition a bit less confusing.
Also note a refactor of the tofu-infra repo is incoming [3], although that
should not affect which resource we track, only how the code is organized.
[0] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/
[1] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
[2] https://phabricator.wikimedia.org/T370037
[3] https://phabricator.wikimedia.org/T375283
[4] https://phabricator.wikimedia.org/T375604
Hi there admins,
as part of the work to replace the VLAN network with a VXLAN-based one [0],
I have changed some horizon settings [1] so that new VMs created via horizon
will have networking configured (addressing) from the VXLAN netowrk.
Also, as part of the VXLAN migration we will try to introduce IPv6 as well. A
special wikitech page has been created to further track this [2].
This only affects codfw1dev, for now. The eqiad1 deployment will follow when we
gain some additional confidence.
If you detect anything weird, please let me know.
regards.
[0] https://phabricator.wikimedia.org/T364725
[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073163
[2] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep…
I will be on sabbatical starting at the end of next week (Saturday the
7th) until November 11th.
As always, the best way to find support about cloud-vps questions is in
#wikimedia-cloud or #wikimedia-cloud-admin, but for more specifics you
can consult my continuity page at
https://office.wikimedia.org/wiki/User:Andrew_Bogott/continuity. In a
pinch you can also reach out to my manager, Joanna Borun.
In addition to some intensive leisure, I plan to spend some of my time
away volunteering with voter protection projects in the lead-up to the
US elections, which accounts for the very specific timing of my absence.
-Andrew
Hey folks,
In case you did not see the update from Tyler already[0], both Gerrit
and GitLab will stay around. The TL;dr is that there's a few
repositories that must stay in Gerrit (from our perspective, most
notably puppet.git), but for the rest of our repositories we're free
to choose which code host we want to use. Here's a quick proposal what
to do:
Our Toolforge related repositories are mostly on GitLab, and they're
making heavy use of GitLab's CI features. I think keeping those there
is the best option for now, and we should move Striker and
labs/toollabs.git there for consistency.
The wmcs-cookbooks repo should stay in Gerrit. That repository is
primarily used by SREs in conjunction with the Puppet repository which
is staying in Gerrit. Similarly I think we should move the new Cloud
VPS tofu-infra repository to Gerrit, as that's also used for SRE
workflows and the ability to merge individual patches in a stack is
useful there similar to how it is on the Puppet repository.
For metricsinfra, we should either migrate the tofu-provisioning
repository from GitLab to Gerrit (which is my preference), or migrate
the prometheus-* repos from Gerrit to GitLab to keep everything
related to that project in one place.
Finally, I think we should move the few repositories we have
canonically on GitHub to GitLab.
Thoughts? I'm happy to draft a formal decision request for my
proposals, although I'm hoping this is simple and uncontroversial
enough to not require one.
[0]: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/…
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi there,
tomorrow 2024-06-26 @ 08:30Z we will start enforcing new Kubernetes security
rules in Toolforge [0].
We have taken measures to eliminate any user impact, but this being a
potentially sensitive change, I wanted to send a heads up email.
In a nut-shell, pod-related kubernetes resources, like Deployment or CronJob
need to have a new set of security-related attributes correctly specified.
This is because we are introducing Kyverno policies as a replacement of the
deprecated PodSecurityPolicies (PSP) [1].
The new Kyverno policies have been deployed already, but are in 'Audit' mode.
What we will be doing tomorrow is setting them to 'Enforce', which is the final
step in this migration, before we can finally drop PSP [2].
Please, report [3] any disruption that you may see.
regards.
[0] https://phabricator.wikimedia.org/T368141
[1] https://phabricator.wikimedia.org/T279110
[2] https://phabricator.wikimedia.org/T364297
[3] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication
dr0ptp4kt (Adam Baso, WMF Staff) requested [0] to be added to the
maintainers of the "admin" tool in Toolforge, in order to gather some
statistics.
This is a subset of the "Tool root" permissions [1] that are usually
assigned to users who need to do administrative work in Toolforge.
Given Adam's needs are more limited, I don't think we need to add any
other permission other than the membership of the "admin" tool.
I'm following the Cloud Services Application Process [2] and sending
this email to communicate the change to other Cloud admins. The policy
recommends a one-week comment period, after which "anyone other than
the applicant may implement the rights change".
[0] https://phabricator.wikimedia.org/T364761
[1] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo…
[2] https://wikitech.wikimedia.org/wiki/Help:Access_policies#Application_Process
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
I wanted to share an interesting failure I just saw on the Toolforge
cluster. The order of events went roughly like this:
1. builds-api had a change merged that was never deployed to the live
clusters. That change only affected local development environments. I
assume that's the reason it was never deployed, although an
alternative is that the person merging the change forgot. This
published builds-api 0.0.131.
2. The harbor expiration policy noticed that builds-api 0.0.131
exists, and pruned the images for 0.0.130.
3. The certificates used for communications between the API gateway
and builds-api got renewed by cert-manager, and this triggered an
automatic restart for the builds-api deployment.
4. The new builds-api pods failed to start as the image they were
running on no longer exists.
Now, in this case, Kubernetes worked as expected, and noticed that the
new deployment did not come up, and it stopped restarts of any further
pods and did not send any traffic to the single restarted pod.
However, the ticking time bomb for the expiring certificates remained
as the API would go down once the old certs expired, and any node
restarts would have risked taking the entire thing down.
I filed a few tasks, mostly about noticing these kinds of issues automatically:
* https://phabricator.wikimedia.org/T358908 Alert when
toolforge-deploy changes are not deployed
* https://phabricator.wikimedia.org/T358909 Alert when admin managed
pods are having issues
In addition we should consider setting up explicit
PodDisruptionBudgets for the admin services we manage.
However, what I'm less certain on is how to prevent that missing image
in the first space:
* Can we store all release tagged images indefinitely? How much
storage space would that take?
* If not, how can we prevent images still in use just disappearing
like that? How do we ensure that rollbacks will always work as
expected?
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi there,
Last year, we starting evaluating how we could refresh the way we relate
(deploy, maintain, upgrade) our Openstack deployment for Cloud VPS [0].
One of the most compelling options we found was to run Openstack inside
Kubernetes, using an upstream project called openstack-helm.
But... What if we stopped doing Openstack at all?
To clarify, the base idea I had is:
* deploy Kubernetes to a bunch of hosts in one of our Wikimedia datacenters
** we know how to do it!
** this would be the base, undercloud, or bedrock, whatever.
* deploy ceph next to k8s (maybe, inside even?)
** ceph would remain the preferred network storage solution
* deploy some kind of k8s multiplexing tech
** example: https://www.vcluster.com/ but there could be others
** using this create a dedicated k8s cluster for each project, for example:
toolforge/toolsbeta/etc
* Inside this new VM-less toolforge, we can retain pretty much the same
functionalities as today:
** a container listening on 22/tcp with kubectl & toolforge cli installed can be
the login bastion
** NFS server can be run on a container, using ceph
** toolsDB can be run on a container. Can't it? Or maybe replace it with other
k8s-native solution
* If we need any of the native openstack components, for example Keystone or
Swift we may run them on an standalone fashion inside k8s.
* We already have some base infrastructure (and knowledge) that would support
this model. We have cloudlbs, cloudgw, we know how to do ceph, etc.
* And finally, and most important: the community. The main question could be:
** Is there any software running on Cloud VPS virtual machines that cannot run
on a container in kubernetes?
I wanted to start this email hoping that I would collect a list of use cases,
blockers, and strong opinions about why running Openstack is important (or not).
I'm pretty sure I'm overlooking some important thing.
I plan to document all this on wikitech, and/or maybe phabricator.
You may ask: and why stop doing openstack? I will answer that in a different
email to keep this one as short as possible.
Looking forward to your counter-arguments.
Thanks!
regards.
[0]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
There are now two new Toolforge Kubernetes workers in service,
tools-k8s-worker-nfs-1 and tools-k8s-worker-nfs-2. In addition to the new
naming scheme that will allow non-NFS workers in the future, these hosts
are also running Debian 12 (as opposed to Debian 10 on the existing nodes)
and are using Containerd as the container runtime (the current nodes are
using Docker).
If you see or hear about any strange issues on pods running on these new
nodes, please depool them (`kubectl sudo drain $WORKER` on a Toolforge
bastion) and ping me on IRC or in the task (
https://phabricator.wikimedia.org/T284656).
If there are no major issues I will start replacing more of the older nodes
with these new nodes next week.
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation