Hi cloud-admin@,
The recent cloud@ thread made me realize that I should probably keep everyone else more up to date on the infrastructure level projects I'm working on by myself. So I've tried to summarize the major recent and upcoming changes I'm working on below in semi-random order.
Please let me know if you find this useful or interesting (or if you don't, helps to know that too). Questions and comments on are also welcome.
Terraform
I sent Puppet patches[0] to enable application credential authentication in Keystone to let arbitrary clients speak to the OpenStack APIs. I believe Andrew is working on the firewall rules and related HAProxy config to open up the APIs to the public as a part of the Cumin/Spicerack work going on at the moment.
I tagged version the initial version of the custom terraform-cloudvps Terraform provider.[1] The provider is designed to supplement the 'official' OpenStack provider and currently lets you interact with the web proxy API using the new go-cloudvps library[1], with Puppet ENC support next up in my Terraform TODO list.
There's also a Puppet patch[3] pending to configure a self-hosted Terraform registry on terraform.wmcloud.org. It's cherry picked to the project puppet master, but having at actually merged would be nice.
[0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/840121 [1]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/terraform-cloudvps [2]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps [3]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834344
CloudVPS web proxy
Planning on doing some work to make the proxy service more reliable in case of node failure. Also planned is moving the current SQLite database to the cloudinfra MariaDB cluster for reliability / easier failover purposes. There are a few Puppet patches prepping for this pending review, starting from [4].
[4]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831041
Toolforge
Sent a few patches to the jobs-framework-* repositories. Planning to do a bit more cleanup here, to hopefully make the grid migration easier.
I'd like to introduce a new k8s utility, kube-container-updater[5], to automatically restart long-running containers that are running outdated images.
Upgrading to Kubernetes 1.22 is only blocked on dealing with certificate generation for the custom webhooks[6]. For this, I'd like to get feedback on the approach (continue to manually sign certificates or introduce cert-manager to automate that). Looking further for the k8s versions, 1.23 will be fairly simple I think and 1.24 will require migrating the cluster from Docker to containerd which I'd like to pair with a bullseye upgrade.
Once we have an object storage service I'd like to look a bit more into providing a logging solution that doesn't use NFS.
[5]: https://gerrit.wikimedia.org/r/c/cloud/toolforge/kube-container-updater/+/82... [6]: https://phabricator.wikimedia.org/T286856
metricsinfra
No recent development here. I think we could roll out Prometheus scraping to all projects and instances with the current infra, but for that someone would need to sort out how to deal with security groups with the pull model Prometheus uses. Some discussion about this is in Phabricator[7].
Second thing next up in the metricsinfra road map is building an API to let projects manage their scraping rules and alerts. I'd like to integrate that with Terraform at some point.
[7]: https://phabricator.wikimedia.org/T288108
Puppet ENC service
Planning to do some work[8] on the ENC API service, mostly to make it work with Terraform. Most notably the Git integration will be moved from the Horizon dashboard to the API service itself.
[8]: https://phabricator.wikimedia.org/T317478
ToolsDB
No recent developments here either. :(
Thanks for the update, Taavi! It's great to have a summary, and these are all great-sounding projects. I will try to catch up on the terraform patches today or tomorrow.
Two tangential points:
1) I can't emphasize enough how much we all appreciate the work you're doing on our infra, and how much you're benefiting our projects. We're always chatting about how much good work you're doing and bragging to other teams about your contributions.
2) We're fairly careful to not actually direct your work, as I don't want to overstep the boundary between staff and volunteer -- your #1 priority should always be to work on whatever you find most fun and interesting at the moment. That said, if you /do/ want more coordination (e.g. invites to in-person meetings, access to quarterly planning docs, etc) you (or anyone else following along) should just say the word and we'll figure out more ways to loop you in.
Thanks again!
-Andrew
On 10/9/22 6:19 AM, Taavi Väänänen wrote:
Hi cloud-admin@,
The recent cloud@ thread made me realize that I should probably keep everyone else more up to date on the infrastructure level projects I'm working on by myself. So I've tried to summarize the major recent and upcoming changes I'm working on below in semi-random order.
Please let me know if you find this useful or interesting (or if you don't, helps to know that too). Questions and comments on are also welcome.
Terraform
I sent Puppet patches[0] to enable application credential authentication in Keystone to let arbitrary clients speak to the OpenStack APIs. I believe Andrew is working on the firewall rules and related HAProxy config to open up the APIs to the public as a part of the Cumin/Spicerack work going on at the moment.
I tagged version the initial version of the custom terraform-cloudvps Terraform provider.[1] The provider is designed to supplement the 'official' OpenStack provider and currently lets you interact with the web proxy API using the new go-cloudvps library[1], with Puppet ENC support next up in my Terraform TODO list.
There's also a Puppet patch[3] pending to configure a self-hosted Terraform registry on terraform.wmcloud.org. It's cherry picked to the project puppet master, but having at actually merged would be nice.
CloudVPS web proxy
Planning on doing some work to make the proxy service more reliable in case of node failure. Also planned is moving the current SQLite database to the cloudinfra MariaDB cluster for reliability / easier failover purposes. There are a few Puppet patches prepping for this pending review, starting from [4].
Toolforge
Sent a few patches to the jobs-framework-* repositories. Planning to do a bit more cleanup here, to hopefully make the grid migration easier.
I'd like to introduce a new k8s utility, kube-container-updater[5], to automatically restart long-running containers that are running outdated images.
Upgrading to Kubernetes 1.22 is only blocked on dealing with certificate generation for the custom webhooks[6]. For this, I'd like to get feedback on the approach (continue to manually sign certificates or introduce cert-manager to automate that). Looking further for the k8s versions, 1.23 will be fairly simple I think and 1.24 will require migrating the cluster from Docker to containerd which I'd like to pair with a bullseye upgrade.
Once we have an object storage service I'd like to look a bit more into providing a logging solution that doesn't use NFS.
metricsinfra
No recent development here. I think we could roll out Prometheus scraping to all projects and instances with the current infra, but for that someone would need to sort out how to deal with security groups with the pull model Prometheus uses. Some discussion about this is in Phabricator[7].
Second thing next up in the metricsinfra road map is building an API to let projects manage their scraping rules and alerts. I'd like to integrate that with Terraform at some point.
Puppet ENC service
Planning to do some work[8] on the ENC API service, mostly to make it work with Terraform. Most notably the Git integration will be moved from the Horizon dashboard to the API service itself.
ToolsDB
No recent developments here either. :(
Cloud-admin mailing list -- cloud-admin@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/
On 11/10/2022 16:47, Andrew Bogott wrote:
Thanks for the update, Taavi! It's great to have a summary, and these are all great-sounding projects. I will try to catch up on the terraform patches today or tomorrow.
Two tangential points:
- I can't emphasize enough how much we all appreciate the work you're
doing on our infra, and how much you're benefiting our projects. We're always chatting about how much good work you're doing and bragging to other teams about your contributions.
:-) Not sure if I've mentioned this before anywhere, but one of the primary reasons I'm around is that you let me play around with cool stuff that would just be impractical to do in a personal lab environment (mostly due to the cost).
- We're fairly careful to not actually direct your work, as I don't
want to overstep the boundary between staff and volunteer -- your #1 priority should always be to work on whatever you find most fun and interesting at the moment. That said, if you /do/ want more coordination (e.g. invites to in-person meetings, access to quarterly planning docs, etc) you (or anyone else following along) should just say the word and we'll figure out more ways to loop you in.
I do indeed like being able to work on the things I like the most at the moment, and certainly 'not enough things to work on' is not a problem I've had around here recently.
That being said, I am still interested in at least taking a look of the quarterly plans, partly because I'm curious on how your plans related to what I'm interested in and partly because I too have tools of my own and am very interested in using some new features (Swift, for example).
Taavi
cloud-admin@lists.wikimedia.org