Hi,
the WMCS team just had a meeting dedicated to tofu-infra [0], workflows, scope,
roadmaps and use cases.
This is a summary.
* The current state, and short/mid term roadmaps have been shared, which
includes refactoring the tofu-infra repo, and adding support for more stuff, for
example quotas.
* It was mentioned that user assignment is maybe not very well tracked in
tofu-infra, because that list can be modified at any given time by end users.
* We kind of agreed that the scope of tofu-infra repository [1] is to track
state for admin-controlled openstack resources. User-managed resources are
explicitly out of the scope of tofu-infra. Tracking Toolforge resources (k8s VMs
and such) are also out of the scope this tofu-infra repository.
* There is some controversy regarding tracking projects resources. The project
definition itself. There is a real concern with project deletion, because it
will leave orphan resources.
* Because the above point, we have been discussing _not_ tracking project
definitions in tofu-infra at all. And automate project deletion via a cookbook
that takes care of removing potentially orphaned resources.
* Obviously not tracking projects comes with the downside of... well, not having
gitops for projects definitions, which is something we kind of want to have.
* We have been discussing potential solutions, including:
** extending wmfkeystonehook to prevent deleting projects if they have resources
** expand our suite of 'leak detector' scripts to report orphaned resources
** having some kind of background daemon automatically cleaning up orphaned
resources
* A strong decision on what to do with project definitions has not been made,
which implies we should keep the current status quo (they are tracked in
tofu-infra).
* We have been discussing ideas and options for integrating tofu-infra with the
cookbooks, with a variety of opinions on whether that is convenient or not, or
if tofu-infra is ready for building automation on top of it or not, etc.
* It was mentioned the use case, or desire, to enable our clinic-duty workflows
for non-technical, non-SRE folks in the organization. While the use case is felt
right, some bits should be clarified, because these folks may not have the
required access level after all, for example to use cookbooks.
Finally, it was agreed that further discussions should happen on this topic soon.
regards.
[0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
[1] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/
Hi,
find here what I believe is the 'current status' regarding the topic 'wikimedia
multi-datacenter cloud services'.
In 2023 there was an attempt to deploy a new openstack deployment in codfw. This
new deployment would have been using a kubernetes undercloud instead of directly
running openstack on hardware [0] We went to the point of procuring hardware for
it [1]. This hardware would had been for the undercloud Kubernetes.
Later, that project was cancelled. The hardware has been proposed to be
repurposed [2]. As of this writing, the WMCS engineers are no longer actively
working (or thinking) or expanding onto another datacenter.
In the past, some of the major questions related to expanding into more DCs have
been related to:
* hardware budget
* engineering time and team roadmap
* services and product roadmap
* some implementation details that affect all of the above
Regarding hardware budget:
* A buildout of a cloud needs to be associated with a significant budget
allocation. Racks, switches, servers, storage, etc.
* Additionally, there are concerns of rack space availability for increased WMCS
footprint in whatever datacenter, beyond a few servers.
Regarding engineering time and team roadmap:
* the WMCS team roadmap (how we use our engineering time) does not contain at
the moment any multi-datacenter work. This is not in the radar for the short/mid
terms.
* needless to say, working on whatever multi-datacenter implementation requires
significant engineering time, and most likely a multi-year commitment in terms
of team goals, and such.
* additionally this may require increased implication and coordination with
other teams: DCops, NetOps, Data platform Engineering, Data persistence, etc.
Services and product roadmap:
* The primary goal for any multi-datacenter setting is to offer increased
availability for cloud services
* It is not clear that the current availability levels (single DC) are
inadequate. It is not clear this is the improvement that our services needs
right at this time, in a way that should be prioritized over other efforts.
* Therefore, the services and product roadmap does not currently reflect any
multi-datacenter work.
Some implementation details that affect all of the above:
* Cloud VPS and Toolforge don't have a definition of how they should be offered
to clients in a multi-datacenter fashion. The same applies to other services,
like wiki-replicas.
* A significant number of decisions should be made regarding how we would shape
all the services to work on a multi-datanceter setting. Things like storage
replication (databases, ceph), or multi-region support in both Cloud VPS or
Toolforge.
* Different implementations can have a significant impact on things like budget,
roadmaps or implementation times. For example, a multi-DC wiki-replicas setup
has been identified as potentially requiring a significant budget allocation.
* Depending on the setup, cross-DC network bandwidth may require additional
considerations as well.
Maybe this is obvious, but overall, any multi-DC cloud initiative could be
perceived as requiring an obvious service need, a multi-year goal plannings and
roadmap, time allocation from multiple SRE teams and non-trivial budget
allocations. At the moment, we have none of them.
regards.
[0] https://phabricator.wikimedia.org/T342750 cloud: introduce a kubernetes
undercloud to run openstack (via openstack-helm)
[1] https://phabricator.wikimedia.org/T341239 Q1:codfw:(5) cloudnet/cloudcontrol
buildout - Config A-10G
[2] https://phabricator.wikimedia.org/T377568 wmcs codfw hardware changes proposal
Hi there,
here is some updates on the state of the Cloud VPS VXLAN/IPv6 project.
First of all, please check the 'initial deploy' [0] page on Wikitech, which
contains valuable information regarding this project. It will certainly contain
answer to questions not present in this email.
== current status ==
The VXLAN/IPv6 setup in codfw1dev is fully operational, and working as expected,
the desired end state, similar to what we will get in eqiad1.
I have been polishing the last few bits before proceeding with eqiad1, and they
are now ready. To name a few:
* nova-fullstack: support for IPv6
* horizon security groups panels: support IPv6
== next steps ==
I will _very soon_ proceed with enabling VXLAN/IPv6 on eqiad1.
It will remain 'invisible' to users until we enable the options to create VMs
with VXLAN/IPv6 in horizon.
Before making it 'visible', I will work on a number of user-facing documents [1].
The migration will proceed as agreed [2].
regards.
[0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep…
[1] https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration
[2]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
I've merged (what I think are) the final patches required for using
custom/vanity domains with the Cloud VPS web proxy. Here is an example:
https://wmcs-proxy-test.taavivaananen.fi/
And administrator documentation is available at [0]. [1] is the task
tracking the implementation of this.
[0]:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable…
[1]: https://phabricator.wikimedia.org/T342398
I have not yet documented this in the user-facing docs, because we first
need to decide which projects can use this feature. Historically the use
of custom domains for Cloud VPS projects has been restricted by the fact
that those required a floating IPv4 and we don't have many of those. My
feeling (but I haven't checked) is that the vast majority of granted
requests from the time I've been here have been for affiliates and for
projects that are migrating from some external hosting with an existing
domain to Cloud VPS.
Now that IPv4 scarcity is no longer a factor in this, we could in theory
set up custom domains for everyone that wants one. Are we willing to do
this or do we want to keep some requirements for having one? In my head
the biggest argument for encouraging/requiring use of *.wmcloud.org is
that it removes a major SPOF possibility from individual maintainers
having control of vanity domains and then disappearing leaving the
project stuck.
Taavi
Hi there,
we are now tracking some parts of our Cloud VPS infra using opentofu.
We have a repository [0] and some docs on wikitech [1].
As of this writing, we have support for a bunch of resources in tofu-infra, and
we consider it to be the source of truth for at least the following elements:
* nova flavors
* neutron networks, subnets, routers, routers ports and security groups
* OpenStack projects
* DNS zones, and some DNS records
Extending coverage to more resource types is in the roadmap [2].
We are in a transition period. There are a bunch of resources that have been
migrated to tofu-infra, but others will be imported "as we go", because
importing everything in one go is too heavy.
That being said, if you see yourself wanting to create or modify any of the
resources mentioned above, you should do via tofu-infra. Ask for help if in doubt.
Be warned that some cookbooks, docs or other code bits may need update. Small
regressions to some of our admin workflows are somewhat expected, as you may be
the first one to eg, create a new project using tofu-infra, or create a new
flavor using tofu-infra.
Additionally, I have been conducting a few cleanups in codfw1dev [4], for stuff
like projects and security groups, with the goal of making this tofu-infra
transition a bit less confusing.
Also note a refactor of the tofu-infra repo is incoming [3], although that
should not affect which resource we track, only how the code is organized.
[0] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/
[1] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
[2] https://phabricator.wikimedia.org/T370037
[3] https://phabricator.wikimedia.org/T375283
[4] https://phabricator.wikimedia.org/T375604
Hi there admins,
as part of the work to replace the VLAN network with a VXLAN-based one [0],
I have changed some horizon settings [1] so that new VMs created via horizon
will have networking configured (addressing) from the VXLAN netowrk.
Also, as part of the VXLAN migration we will try to introduce IPv6 as well. A
special wikitech page has been created to further track this [2].
This only affects codfw1dev, for now. The eqiad1 deployment will follow when we
gain some additional confidence.
If you detect anything weird, please let me know.
regards.
[0] https://phabricator.wikimedia.org/T364725
[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073163
[2] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep…
I will be on sabbatical starting at the end of next week (Saturday the
7th) until November 11th.
As always, the best way to find support about cloud-vps questions is in
#wikimedia-cloud or #wikimedia-cloud-admin, but for more specifics you
can consult my continuity page at
https://office.wikimedia.org/wiki/User:Andrew_Bogott/continuity. In a
pinch you can also reach out to my manager, Joanna Borun.
In addition to some intensive leisure, I plan to spend some of my time
away volunteering with voter protection projects in the lead-up to the
US elections, which accounts for the very specific timing of my absence.
-Andrew
Hey folks,
In case you did not see the update from Tyler already[0], both Gerrit
and GitLab will stay around. The TL;dr is that there's a few
repositories that must stay in Gerrit (from our perspective, most
notably puppet.git), but for the rest of our repositories we're free
to choose which code host we want to use. Here's a quick proposal what
to do:
Our Toolforge related repositories are mostly on GitLab, and they're
making heavy use of GitLab's CI features. I think keeping those there
is the best option for now, and we should move Striker and
labs/toollabs.git there for consistency.
The wmcs-cookbooks repo should stay in Gerrit. That repository is
primarily used by SREs in conjunction with the Puppet repository which
is staying in Gerrit. Similarly I think we should move the new Cloud
VPS tofu-infra repository to Gerrit, as that's also used for SRE
workflows and the ability to merge individual patches in a stack is
useful there similar to how it is on the Puppet repository.
For metricsinfra, we should either migrate the tofu-provisioning
repository from GitLab to Gerrit (which is my preference), or migrate
the prometheus-* repos from Gerrit to GitLab to keep everything
related to that project in one place.
Finally, I think we should move the few repositories we have
canonically on GitHub to GitLab.
Thoughts? I'm happy to draft a formal decision request for my
proposals, although I'm hoping this is simple and uncontroversial
enough to not require one.
[0]: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/…
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
Hi there,
tomorrow 2024-06-26 @ 08:30Z we will start enforcing new Kubernetes security
rules in Toolforge [0].
We have taken measures to eliminate any user impact, but this being a
potentially sensitive change, I wanted to send a heads up email.
In a nut-shell, pod-related kubernetes resources, like Deployment or CronJob
need to have a new set of security-related attributes correctly specified.
This is because we are introducing Kyverno policies as a replacement of the
deprecated PodSecurityPolicies (PSP) [1].
The new Kyverno policies have been deployed already, but are in 'Audit' mode.
What we will be doing tomorrow is setting them to 'Enforce', which is the final
step in this migration, before we can finally drop PSP [2].
Please, report [3] any disruption that you may see.
regards.
[0] https://phabricator.wikimedia.org/T368141
[1] https://phabricator.wikimedia.org/T279110
[2] https://phabricator.wikimedia.org/T364297
[3] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication
dr0ptp4kt (Adam Baso, WMF Staff) requested [0] to be added to the
maintainers of the "admin" tool in Toolforge, in order to gather some
statistics.
This is a subset of the "Tool root" permissions [1] that are usually
assigned to users who need to do administrative work in Toolforge.
Given Adam's needs are more limited, I don't think we need to add any
other permission other than the membership of the "admin" tool.
I'm following the Cloud Services Application Process [2] and sending
this email to communicate the change to other Cloud admins. The policy
recommends a one-week comment period, after which "anyone other than
the applicant may implement the rights change".
[0] https://phabricator.wikimedia.org/T364761
[1] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo…
[2] https://wikitech.wikimedia.org/wiki/Help:Access_policies#Application_Process
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation