- Cloud-admin - lists.wikimedia.org

regarding tofu-infra, workflows, scope, roadmaps and use cases
by Arturo Borrero Gonzalez 21 Nov '24

21 Nov '24

Hi, the WMCS team just had a meeting dedicated to tofu-infra [0], workflows, scope, roadmaps and use cases. This is a summary. * The current state, and short/mid term roadmaps have been shared, which includes refactoring the tofu-infra repo, and adding support for more stuff, for example quotas. * It was mentioned that user assignment is maybe not very well tracked in tofu-infra, because that list can be modified at any given time by end users. * We kind of agreed that the scope of tofu-infra repository [1] is to track state for admin-controlled openstack resources. User-managed resources are explicitly out of the scope of tofu-infra. Tracking Toolforge resources (k8s VMs and such) are also out of the scope this tofu-infra repository. * There is some controversy regarding tracking projects resources. The project definition itself. There is a real concern with project deletion, because it will leave orphan resources. * Because the above point, we have been discussing _not_ tracking project definitions in tofu-infra at all. And automate project deletion via a cookbook that takes care of removing potentially orphaned resources. * Obviously not tracking projects comes with the downside of... well, not having gitops for projects definitions, which is something we kind of want to have. * We have been discussing potential solutions, including: ** extending wmfkeystonehook to prevent deleting projects if they have resources ** expand our suite of 'leak detector' scripts to report orphaned resources ** having some kind of background daemon automatically cleaning up orphaned resources * A strong decision on what to do with project definitions has not been made, which implies we should keep the current status quo (they are tracked in tofu-infra). * We have been discussing ideas and options for integrating tofu-infra with the cookbooks, with a variety of opinions on whether that is convenient or not, or if tofu-infra is ready for building automation on top of it or not, etc. * It was mentioned the use case, or desire, to enable our clinic-duty workflows for non-technical, non-SRE folks in the organization. While the use case is felt right, some bits should be clarified, because these folks may not have the required access level after all, for example to use cookbooks. Finally, it was agreed that further discussions should happen on this topic soon. regards. [0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu [1] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/

2 1

Regarding wikimedia multi-datacenter cloud services
by Arturo Borrero Gonzalez 19 Nov '24

19 Nov '24

Hi, find here what I believe is the 'current status' regarding the topic 'wikimedia multi-datacenter cloud services'. In 2023 there was an attempt to deploy a new openstack deployment in codfw. This new deployment would have been using a kubernetes undercloud instead of directly running openstack on hardware [0] We went to the point of procuring hardware for it [1]. This hardware would had been for the undercloud Kubernetes. Later, that project was cancelled. The hardware has been proposed to be repurposed [2]. As of this writing, the WMCS engineers are no longer actively working (or thinking) or expanding onto another datacenter. In the past, some of the major questions related to expanding into more DCs have been related to: * hardware budget * engineering time and team roadmap * services and product roadmap * some implementation details that affect all of the above Regarding hardware budget: * A buildout of a cloud needs to be associated with a significant budget allocation. Racks, switches, servers, storage, etc. * Additionally, there are concerns of rack space availability for increased WMCS footprint in whatever datacenter, beyond a few servers. Regarding engineering time and team roadmap: * the WMCS team roadmap (how we use our engineering time) does not contain at the moment any multi-datacenter work. This is not in the radar for the short/mid terms. * needless to say, working on whatever multi-datacenter implementation requires significant engineering time, and most likely a multi-year commitment in terms of team goals, and such. * additionally this may require increased implication and coordination with other teams: DCops, NetOps, Data platform Engineering, Data persistence, etc. Services and product roadmap: * The primary goal for any multi-datacenter setting is to offer increased availability for cloud services * It is not clear that the current availability levels (single DC) are inadequate. It is not clear this is the improvement that our services needs right at this time, in a way that should be prioritized over other efforts. * Therefore, the services and product roadmap does not currently reflect any multi-datacenter work. Some implementation details that affect all of the above: * Cloud VPS and Toolforge don't have a definition of how they should be offered to clients in a multi-datacenter fashion. The same applies to other services, like wiki-replicas. * A significant number of decisions should be made regarding how we would shape all the services to work on a multi-datanceter setting. Things like storage replication (databases, ceph), or multi-region support in both Cloud VPS or Toolforge. * Different implementations can have a significant impact on things like budget, roadmaps or implementation times. For example, a multi-DC wiki-replicas setup has been identified as potentially requiring a significant budget allocation. * Depending on the setup, cross-DC network bandwidth may require additional considerations as well. Maybe this is obvious, but overall, any multi-DC cloud initiative could be perceived as requiring an obvious service need, a multi-year goal plannings and roadmap, time allocation from multiple SRE teams and non-trivial budget allocations. At the moment, we have none of them. regards. [0] https://phabricator.wikimedia.org/T342750 cloud: introduce a kubernetes undercloud to run openstack (via openstack-helm) [1] https://phabricator.wikimedia.org/T341239 Q1:codfw:(5) cloudnet/cloudcontrol buildout - Config A-10G [2] https://phabricator.wikimedia.org/T377568 wmcs codfw hardware changes proposal

1 0

Update on Cloud VPS VXLAN/IPv6 project
by Arturo Borrero Gonzalez 15 Nov '24

15 Nov '24

Hi there, here is some updates on the state of the Cloud VPS VXLAN/IPv6 project. First of all, please check the 'initial deploy' [0] page on Wikitech, which contains valuable information regarding this project. It will certainly contain answer to questions not present in this email. == current status == The VXLAN/IPv6 setup in codfw1dev is fully operational, and working as expected, the desired end state, similar to what we will get in eqiad1. I have been polishing the last few bits before proceeding with eqiad1, and they are now ready. To name a few: * nova-fullstack: support for IPv6 * horizon security groups panels: support IPv6 == next steps == I will _very soon_ proceed with enabling VXLAN/IPv6 on eqiad1. It will remain 'invisible' to users until we enable the options to create VMs with VXLAN/IPv6 in horizon. Before making it 'visible', I will work on a number of user-facing documents [1]. The migration will proceed as agreed [2]. regards. [0] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep… [1] https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_VXLAN_IPv6_migration [2] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…

1 0

Novaproxy now has (technical) support for custom domains
by Taavi Väänänen 01 Nov '24

01 Nov '24

I've merged (what I think are) the final patches required for using custom/vanity domains with the Cloud VPS web proxy. Here is an example: https://wmcs-proxy-test.taavivaananen.fi/ And administrator documentation is available at [0]. [1] is the task tracking the implementation of this. [0]: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable… [1]: https://phabricator.wikimedia.org/T342398 I have not yet documented this in the user-facing docs, because we first need to decide which projects can use this feature. Historically the use of custom domains for Cloud VPS projects has been restricted by the fact that those required a floating IPv4 and we don't have many of those. My feeling (but I haven't checked) is that the vast majority of granted requests from the time I've been here have been for affiliates and for projects that are migrating from some external hosting with an existing domain to Cloud VPS. Now that IPv4 scarcity is no longer a factor in this, we could in theory set up custom domains for everyone that wants one. Are we willing to do this or do we want to keep some requirements for having one? In my head the biggest argument for encouraging/requiring use of *.wmcloud.org is that it removes a major SPOF possibility from individual maintainers having control of vanity domains and then disappearing leaving the project stuck. Taavi

4 3

Cloud VPS: tofu-infra status
by Arturo Borrero Gonzalez 25 Sep '24

25 Sep '24

Hi there, we are now tracking some parts of our Cloud VPS infra using opentofu. We have a repository [0] and some docs on wikitech [1]. As of this writing, we have support for a bunch of resources in tofu-infra, and we consider it to be the source of truth for at least the following elements: * nova flavors * neutron networks, subnets, routers, routers ports and security groups * OpenStack projects * DNS zones, and some DNS records Extending coverage to more resource types is in the roadmap [2]. We are in a transition period. There are a bunch of resources that have been migrated to tofu-infra, but others will be imported "as we go", because importing everything in one go is too heavy. That being said, if you see yourself wanting to create or modify any of the resources mentioned above, you should do via tofu-infra. Ask for help if in doubt. Be warned that some cookbooks, docs or other code bits may need update. Small regressions to some of our admin workflows are somewhat expected, as you may be the first one to eg, create a new project using tofu-infra, or create a new flavor using tofu-infra. Additionally, I have been conducting a few cleanups in codfw1dev [4], for stuff like projects and security groups, with the goal of making this tofu-infra transition a bit less confusing. Also note a refactor of the tofu-infra repo is incoming [3], although that should not affect which resource we track, only how the code is organized. [0] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/ [1] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu [2] https://phabricator.wikimedia.org/T370037 [3] https://phabricator.wikimedia.org/T375283 [4] https://phabricator.wikimedia.org/T375604

1 0

codfw1dev: VMs on the vxlan network. And IPv6
by Arturo Borrero Gonzalez 16 Sep '24

16 Sep '24

Hi there admins, as part of the work to replace the VLAN network with a VXLAN-based one [0], I have changed some horizon settings [1] so that new VMs created via horizon will have networking configured (addressing) from the VXLAN netowrk. Also, as part of the VXLAN migration we will try to introduce IPv6 as well. A special wikitech page has been created to further track this [2]. This only affects codfw1dev, for now. The eqiad1 deployment will follow when we gain some additional confidence. If you detect anything weird, please let me know. regards. [0] https://phabricator.wikimedia.org/T364725 [1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073163 [2] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/IPv6/initial_dep…

1 0

Andrew Bogott on sabbatical 2024-09-07 through 2024-11-12
by Andrew Bogott 27 Aug '24

27 Aug '24

I will be on sabbatical starting at the end of next week (Saturday the 7th) until November 11th. As always, the best way to find support about cloud-vps questions is in #wikimedia-cloud or #wikimedia-cloud-admin, but for more specifics you can consult my continuity page at https://office.wikimedia.org/wiki/User:Andrew_Bogott/continuity. In a pinch you can also reach out to my manager, Joanna Borun. In addition to some intensive leisure, I plan to spend some of my time away volunteering with voter protection projects in the lead-up to the US elections, which accounts for the very specific timing of my absence. -Andrew

1 0

WMCS GitLab migration status
by Taavi Väänänen 26 Jun '24

26 Jun '24

Hey folks, In case you did not see the update from Tyler already[0], both Gerrit and GitLab will stay around. The TL;dr is that there's a few repositories that must stay in Gerrit (from our perspective, most notably puppet.git), but for the rest of our repositories we're free to choose which code host we want to use. Here's a quick proposal what to do: Our Toolforge related repositories are mostly on GitLab, and they're making heavy use of GitLab's CI features. I think keeping those there is the best option for now, and we should move Striker and labs/toollabs.git there for consistency. The wmcs-cookbooks repo should stay in Gerrit. That repository is primarily used by SREs in conjunction with the Puppet repository which is staying in Gerrit. Similarly I think we should move the new Cloud VPS tofu-infra repository to Gerrit, as that's also used for SRE workflows and the ability to merge individual patches in a stack is useful there similar to how it is on the Puppet repository. For metricsinfra, we should either migrate the tofu-provisioning repository from GitLab to Gerrit (which is my preference), or migrate the prometheus-* repos from Gerrit to GitLab to keep everything related to that project in one place. Finally, I think we should move the few repositories we have canonically on GitHub to GitLab. Thoughts? I'm happy to draft a formal decision request for my proposals, although I'm hoping this is simple and uncontroversial enough to not require one. [0]: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/… Taavi -- Taavi Väänänen (he/him) Site Reliability Engineer, Cloud Services Wikimedia Foundation

5 6

Toolforge: introducing new security policy engine 2024-06-26 @ 08:30 UTZ
by Arturo Borrero Gonzalez 25 Jun '24

25 Jun '24

Hi there, tomorrow 2024-06-26 @ 08:30Z we will start enforcing new Kubernetes security rules in Toolforge [0]. We have taken measures to eliminate any user impact, but this being a potentially sensitive change, I wanted to send a heads up email. In a nut-shell, pod-related kubernetes resources, like Deployment or CronJob need to have a new set of security-related attributes correctly specified. This is because we are introducing Kyverno policies as a replacement of the deprecated PodSecurityPolicies (PSP) [1]. The new Kyverno policies have been deployed already, but are in 'Audit' mode. What we will be doing tomorrow is setting them to 'Enforce', which is the final step in this migration, before we can finally drop PSP [2]. Please, report [3] any disruption that you may see. regards. [0] https://phabricator.wikimedia.org/T368141 [1] https://phabricator.wikimedia.org/T279110 [2] https://phabricator.wikimedia.org/T364297 [3] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication

1 0

Adding dr0ptp4kt to Toolforge admins
by Francesco Negri 15 May '24

15 May '24

dr0ptp4kt (Adam Baso, WMF Staff) requested [0] to be added to the maintainers of the "admin" tool in Toolforge, in order to gather some statistics. This is a subset of the "Tool root" permissions [1] that are usually assigned to users who need to do administrative work in Toolforge. Given Adam's needs are more limited, I don't think we need to add any other permission other than the membership of the "admin" tool. I'm following the Cloud Services Application Process [2] and sending this email to communicate the change to other Cloud admins. The policy recommends a one-week comment period, after which "anyone other than the applicant may implement the rights change". [0] https://phabricator.wikimedia.org/T364761 [1] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo… [2] https://wikitech.wikimedia.org/wiki/Help:Access_policies#Application_Process -- Francesco Negri (he/him) -- IRC: dhinus Site Reliability Engineer, Cloud Services team Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin