Hi,
find here what I believe is the 'current status' regarding the topic 'wikimedia multi-datacenter cloud services'.
In 2023 there was an attempt to deploy a new openstack deployment in codfw. This new deployment would have been using a kubernetes undercloud instead of directly running openstack on hardware [0] We went to the point of procuring hardware for it [1]. This hardware would had been for the undercloud Kubernetes.
Later, that project was cancelled. The hardware has been proposed to be repurposed [2]. As of this writing, the WMCS engineers are no longer actively working (or thinking) or expanding onto another datacenter.
In the past, some of the major questions related to expanding into more DCs have been related to: * hardware budget * engineering time and team roadmap * services and product roadmap * some implementation details that affect all of the above
Regarding hardware budget:
* A buildout of a cloud needs to be associated with a significant budget allocation. Racks, switches, servers, storage, etc. * Additionally, there are concerns of rack space availability for increased WMCS footprint in whatever datacenter, beyond a few servers.
Regarding engineering time and team roadmap:
* the WMCS team roadmap (how we use our engineering time) does not contain at the moment any multi-datacenter work. This is not in the radar for the short/mid terms. * needless to say, working on whatever multi-datacenter implementation requires significant engineering time, and most likely a multi-year commitment in terms of team goals, and such. * additionally this may require increased implication and coordination with other teams: DCops, NetOps, Data platform Engineering, Data persistence, etc.
Services and product roadmap:
* The primary goal for any multi-datacenter setting is to offer increased availability for cloud services * It is not clear that the current availability levels (single DC) are inadequate. It is not clear this is the improvement that our services needs right at this time, in a way that should be prioritized over other efforts. * Therefore, the services and product roadmap does not currently reflect any multi-datacenter work.
Some implementation details that affect all of the above:
* Cloud VPS and Toolforge don't have a definition of how they should be offered to clients in a multi-datacenter fashion. The same applies to other services, like wiki-replicas. * A significant number of decisions should be made regarding how we would shape all the services to work on a multi-datanceter setting. Things like storage replication (databases, ceph), or multi-region support in both Cloud VPS or Toolforge. * Different implementations can have a significant impact on things like budget, roadmaps or implementation times. For example, a multi-DC wiki-replicas setup has been identified as potentially requiring a significant budget allocation. * Depending on the setup, cross-DC network bandwidth may require additional considerations as well.
Maybe this is obvious, but overall, any multi-DC cloud initiative could be perceived as requiring an obvious service need, a multi-year goal plannings and roadmap, time allocation from multiple SRE teams and non-trivial budget allocations. At the moment, we have none of them.
regards.
[0] https://phabricator.wikimedia.org/T342750 cloud: introduce a kubernetes undercloud to run openstack (via openstack-helm) [1] https://phabricator.wikimedia.org/T341239 Q1:codfw:(5) cloudnet/cloudcontrol buildout - Config A-10G [2] https://phabricator.wikimedia.org/T377568 wmcs codfw hardware changes proposal