Hi,
the WMCS team just had a meeting dedicated to tofu-infra [0], workflows, scope,
roadmaps and use cases.
This is a summary.
* The current state, and short/mid term roadmaps have been shared, which
includes refactoring the tofu-infra repo, and adding support for more stuff, for
example quotas.
* It was mentioned that user assignment is maybe not very well tracked in
tofu-infra, because that list can be modified at any given time by end users.
* We kind of agreed that the scope of tofu-infra repository [1] is to track
state for admin-controlled openstack resources. User-managed resources are
explicitly out of the scope of tofu-infra. Tracking Toolforge resources (k8s VMs
and such) are also out of the scope this tofu-infra repository.
* There is some controversy regarding tracking projects resources. The project
definition itself. There is a real concern with project deletion, because it
will leave orphan resources.
* Because the above point, we have been discussing _not_ tracking project
definitions in tofu-infra at all. And automate project deletion via a cookbook
that takes care of removing potentially orphaned resources.
* Obviously not tracking projects comes with the downside of... well, not having
gitops for projects definitions, which is something we kind of want to have.
* We have been discussing potential solutions, including:
** extending wmfkeystonehook to prevent deleting projects if they have resources
** expand our suite of 'leak detector' scripts to report orphaned resources
** having some kind of background daemon automatically cleaning up orphaned
resources
* A strong decision on what to do with project definitions has not been made,
which implies we should keep the current status quo (they are tracked in
tofu-infra).
* We have been discussing ideas and options for integrating tofu-infra with the
cookbooks, with a variety of opinions on whether that is convenient or not, or
if tofu-infra is ready for building automation on top of it or not, etc.
* It was mentioned the use case, or desire, to enable our clinic-duty workflows
for non-technical, non-SRE folks in the organization. While the use case is felt
right, some bits should be clarified, because these folks may not have the
required access level after all, for example to use cookbooks.
Finally, it was agreed that further discussions should happen on this topic soon.
regards.
[0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
[1] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/
Hi,
find here what I believe is the 'current status' regarding the topic 'wikimedia
multi-datacenter cloud services'.
In 2023 there was an attempt to deploy a new openstack deployment in codfw. This
new deployment would have been using a kubernetes undercloud instead of directly
running openstack on hardware [0] We went to the point of procuring hardware for
it [1]. This hardware would had been for the undercloud Kubernetes.
Later, that project was cancelled. The hardware has been proposed to be
repurposed [2]. As of this writing, the WMCS engineers are no longer actively
working (or thinking) or expanding onto another datacenter.
In the past, some of the major questions related to expanding into more DCs have
been related to:
* hardware budget
* engineering time and team roadmap
* services and product roadmap
* some implementation details that affect all of the above
Regarding hardware budget:
* A buildout of a cloud needs to be associated with a significant budget
allocation. Racks, switches, servers, storage, etc.
* Additionally, there are concerns of rack space availability for increased WMCS
footprint in whatever datacenter, beyond a few servers.
Regarding engineering time and team roadmap:
* the WMCS team roadmap (how we use our engineering time) does not contain at
the moment any multi-datacenter work. This is not in the radar for the short/mid
terms.
* needless to say, working on whatever multi-datacenter implementation requires
significant engineering time, and most likely a multi-year commitment in terms
of team goals, and such.
* additionally this may require increased implication and coordination with
other teams: DCops, NetOps, Data platform Engineering, Data persistence, etc.
Services and product roadmap:
* The primary goal for any multi-datacenter setting is to offer increased
availability for cloud services
* It is not clear that the current availability levels (single DC) are
inadequate. It is not clear this is the improvement that our services needs
right at this time, in a way that should be prioritized over other efforts.
* Therefore, the services and product roadmap does not currently reflect any
multi-datacenter work.
Some implementation details that affect all of the above:
* Cloud VPS and Toolforge don't have a definition of how they should be offered
to clients in a multi-datacenter fashion. The same applies to other services,
like wiki-replicas.
* A significant number of decisions should be made regarding how we would shape
all the services to work on a multi-datanceter setting. Things like storage
replication (databases, ceph), or multi-region support in both Cloud VPS or
Toolforge.
* Different implementations can have a significant impact on things like budget,
roadmaps or implementation times. For example, a multi-DC wiki-replicas setup
has been identified as potentially requiring a significant budget allocation.
* Depending on the setup, cross-DC network bandwidth may require additional
considerations as well.
Maybe this is obvious, but overall, any multi-DC cloud initiative could be
perceived as requiring an obvious service need, a multi-year goal plannings and
roadmap, time allocation from multiple SRE teams and non-trivial budget
allocations. At the moment, we have none of them.
regards.
[0] https://phabricator.wikimedia.org/T342750 cloud: introduce a kubernetes
undercloud to run openstack (via openstack-helm)
[1] https://phabricator.wikimedia.org/T341239 Q1:codfw:(5) cloudnet/cloudcontrol
buildout - Config A-10G
[2] https://phabricator.wikimedia.org/T377568 wmcs codfw hardware changes proposal
Hi there,
here is some updates on the state of the Cloud VPS VXLAN/IPv6 project.
First of all, please check the 'initial deploy' [0] page on Wikitech, which
contains valuable information regarding this project. It will certainly contain
answer to questions not present in this email.
== current status ==
The VXLAN/IPv6 setup in codfw1dev is fully operational, and working as expected,
the desired end state, similar to what we will get in eqiad1.
I have been polishing the last few bits before proceeding with eqiad1, and they
are now ready. To name a few:
* nova-fullstack: support for IPv6
* horizon security groups panels: support IPv6
== next steps ==
I will _very soon_ proceed with enabling VXLAN/IPv6 on eqiad1.
It will remain 'invisible' to users until we enable the options to create VMs
with VXLAN/IPv6 in horizon.
Before making it 'visible', I will work on a number of user-facing documents [1].
The migration will proceed as agreed [2].
regards.
[0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep…
[1] https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration
[2]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
I've merged (what I think are) the final patches required for using
custom/vanity domains with the Cloud VPS web proxy. Here is an example:
https://wmcs-proxy-test.taavivaananen.fi/
And administrator documentation is available at [0]. [1] is the task
tracking the implementation of this.
[0]:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable…
[1]: https://phabricator.wikimedia.org/T342398
I have not yet documented this in the user-facing docs, because we first
need to decide which projects can use this feature. Historically the use
of custom domains for Cloud VPS projects has been restricted by the fact
that those required a floating IPv4 and we don't have many of those. My
feeling (but I haven't checked) is that the vast majority of granted
requests from the time I've been here have been for affiliates and for
projects that are migrating from some external hosting with an existing
domain to Cloud VPS.
Now that IPv4 scarcity is no longer a factor in this, we could in theory
set up custom domains for everyone that wants one. Are we willing to do
this or do we want to keep some requirements for having one? In my head
the biggest argument for encouraging/requiring use of *.wmcloud.org is
that it removes a major SPOF possibility from individual maintainers
having control of vanity domains and then disappearing leaving the
project stuck.
Taavi