- Cloud-admin - lists.wikimedia.org

Next steps to deal with Bryan's addition of fonts to the php7.4 docker image
by Bryan Davis 27 Oct '22

27 Oct '22

Continuing from my post on cloud@... On Thu, Oct 6, 2022 at 6:21 PM Bryan Davis <bd808(a)wikimedia.org> wrote: > > On Thu, Oct 6, 2022 at 5:39 AM Taavi Väänänen <hi(a)taavi.wtf> wrote: > > > > In general, I feel that over the last few months, > > quite a lot of planning and progress reporting has moved from our > > various public channels (most notably Phabricator and -cloud-admin on > > IRC) to private ones. I don't particularly like this trend. > > I did a thing in my late afternoon yesterday that may have aggravated > Tavvi's feelings of being left out of decision loops. > > I made a decision without consulting any other Toolforge admins to add > about 300MiB of fonts to the php7.4 Docker image available for use on > Toolforge [0]. This decision reversed my prior blocking of this exact > same request in 2019 [1]. It also goes against at least as many years > of the Toolforge admins telling the Toolforge member community that we > do not "bloat" the Kubernetes containers with specialty features for a > small number of use cases. This reversal will complicate future > decisions on such issues by introducing this easily seen counter > example. I acted with good intent in the moment, but I did not act > with good judgement nor consideration of my partners in maintaining > the Toolforge infrastructure. For that I am truly sorry. > > I would also like to apologize for treating what I was doing as > "urgent" when it could have easily waited for a discussion with others > either in code review or in other forums. This false urgency was > counter to what I know to be the best way to treat technical decisions > and it was disrespectful of my co-admins in the Toolforge environment. > > I would also like to have a conversation among the Toolforge admins > about how to best deal with this decision going forward. That > conversation is probably better had on Phabricator or the cloud-admin > mailing list than here, but it should happen and it should result in > either reverting the change that I made or jointly creating updated > guidelines for what is and is not acceptable in the shared Kubernetes > containers while we await better methods of managing per-tool feature > differences. > > [0]: https://phabricator.wikimedia.org/T310435#8288848 > [1]: https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+… For the fonts themselves, should we: * Revert the change and tell svgtranslate to move back to the grid? * Propagate the change outward by making the same/similar change to all php images? * Propagate the change outward by making the same/similar change to all base images? * Let it be. For the bigger picture of breaking our long held stance on "bloat", I would like to hear suggestions from y'all. If the font decision is to revert then maybe there is nothing to talk about here. If the fonts stay then I think there is a need to either document this as a rogue action that has been allowed to stand which should not set a precedent for the future or to come up with a rubric for what is allowed and why. I am also open to hearing from anyone on or off list who feels that I need to make additional amends to the Toolforge admins, the Toolforge user community, or any particular individuals. I really didn't mean to make a mess, but I did and I would like to work towards correcting that as much as possible. Bryan PS I will be out of office until 2022-10-11, but I will try to check in on this thread in the intervening days. -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

4 4

Please use $::wmcs_project instead of $::labsproject in ops/puppet.git
by Arturo Borrero Gonzalez 26 Oct '22

26 Oct '22

Hi there, just today I introduced the $::wmcs_project var to replace $::labsproject https://gerrit.wikimedia.org/r/c/operations/puppet/+/849050 Please use the modern one. Perhaps I should try to nerd-snipe someone to see if we can have the linter reject new patches with the old variable. regards. -- Arturo Borrero Gonzalez Senior Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

2 1

Taavi project status update
by Taavi Väänänen 13 Oct '22

13 Oct '22

Hi cloud-admin@, The recent cloud@ thread made me realize that I should probably keep everyone else more up to date on the infrastructure level projects I'm working on by myself. So I've tried to summarize the major recent and upcoming changes I'm working on below in semi-random order. Please let me know if you find this useful or interesting (or if you don't, helps to know that too). Questions and comments on are also welcome. Terraform I sent Puppet patches[0] to enable application credential authentication in Keystone to let arbitrary clients speak to the OpenStack APIs. I believe Andrew is working on the firewall rules and related HAProxy config to open up the APIs to the public as a part of the Cumin/Spicerack work going on at the moment. I tagged version the initial version of the custom terraform-cloudvps Terraform provider.[1] The provider is designed to supplement the 'official' OpenStack provider and currently lets you interact with the web proxy API using the new go-cloudvps library[1], with Puppet ENC support next up in my Terraform TODO list. There's also a Puppet patch[3] pending to configure a self-hosted Terraform registry on terraform.wmcloud.org. It's cherry picked to the project puppet master, but having at actually merged would be nice. [0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/840121 [1]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/terraform-cloudvps [2]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps [3]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834344 CloudVPS web proxy Planning on doing some work to make the proxy service more reliable in case of node failure. Also planned is moving the current SQLite database to the cloudinfra MariaDB cluster for reliability / easier failover purposes. There are a few Puppet patches prepping for this pending review, starting from [4]. [4]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831041 Toolforge Sent a few patches to the jobs-framework-* repositories. Planning to do a bit more cleanup here, to hopefully make the grid migration easier. I'd like to introduce a new k8s utility, kube-container-updater[5], to automatically restart long-running containers that are running outdated images. Upgrading to Kubernetes 1.22 is only blocked on dealing with certificate generation for the custom webhooks[6]. For this, I'd like to get feedback on the approach (continue to manually sign certificates or introduce cert-manager to automate that). Looking further for the k8s versions, 1.23 will be fairly simple I think and 1.24 will require migrating the cluster from Docker to containerd which I'd like to pair with a bullseye upgrade. Once we have an object storage service I'd like to look a bit more into providing a logging solution that doesn't use NFS. [5]: https://gerrit.wikimedia.org/r/c/cloud/toolforge/kube-container-updater/+/8… [6]: https://phabricator.wikimedia.org/T286856 metricsinfra No recent development here. I think we could roll out Prometheus scraping to all projects and instances with the current infra, but for that someone would need to sort out how to deal with security groups with the pull model Prometheus uses. Some discussion about this is in Phabricator[7]. Second thing next up in the metricsinfra road map is building an API to let projects manage their scraping rules and alerts. I'd like to integrate that with Terraform at some point. [7]: https://phabricator.wikimedia.org/T288108 Puppet ENC service Planning to do some work[8] on the ENC API service, mostly to make it work with Terraform. Most notably the Git integration will be moved from the Horizon dashboard to the API service itself. [8]: https://phabricator.wikimedia.org/T317478 ToolsDB No recent developments here either. :(

2 2

Network maintenance
by Arturo Borrero Gonzalez 07 Oct '22

07 Oct '22

Hi there, We are currently working on replacing older hardware servers with newer ones, in particular those dedicated to cloud networking [0]. We have discovered a few shortcomings related mostly to network interface naming in the newer servers, and the latest openstack version behaving differently to what it used to be, and also some base operating system (debian) bugs [1]. Some of these are hardware-dependant and difficult to reproduce/anticipate in our staging environment. The result is that we are having a more challenging and noisy migration than we would like. We already had a few (brief) network outages trying to introduce the new servers into service. We'll try to keep things as stable as possible in the next few days until the migration is completed, but we can't discard having some more (brief) network outages until we are safely on the other side of the transition. I'll send another note when we finish this network maintenance is over. regards. [0] https://phabricator.wikimedia.org/T316284 [1] https://bugs.debian.org/989162 -- Arturo Borrero Gonzalez Senior Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

1 1

Application for Toolforge root made by ThereIsNoTime
by Bryan Davis 10 Sep '22

10 Sep '22

Per https://wikitech.wikimedia.org/wiki/Help:Access_policies#Application_Process this is a notice to everyone who currently has Toolforge root that User:ThereIsNoTime [0][1] has filed a Phabricator task requesting Toolforge root rights [2]. The comment period will remain open on this application until at least 2022-09-16. Public comments can be left on the Phabricator task [2]. Private comments can be sent to Nicholas, Andrew, or Bryan directly. [0]: https://www.mediawiki.org/wiki/User:TheresNoTime [1]: https://wikitech.wikimedia.org/wiki/User:Samtar [2]: https://phabricator.wikimedia.org/T317438 Bryan -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

1 0

save your home! (If you have logins on bare metal)
by Andrew Bogott 03 Aug '22

03 Aug '22

The following servers are going to be destroyed in the next few days: cloudcontrol1003.wikimedia.org cloudcontrol1004.wikimedia.org labweb1001.wikimedia.org labweb1002.wikimedia.org If you have any persistent, useful content in your $HOME on those servers, please rescue it now! Soon it will be too late. -Andrew

1 0

Making custom flavors for cloud-vps projects
by Andrew Bogott 13 May '22

13 May '22

In my daily reboots, I'm coming across a few weird VMs that should be on ceph but instead are using local storage on hypervisors. I've tracked this down to particular flavors which don't specify the Ceph backend. For example, here's a 'good' flavor: root@cloudcontrol1003:~# openstack flavor show g3.cores4.ram8.disk20 +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | description | None | | disk | 20 | | id | c14b5856-5e6a-4f7a-8125-ace4f616c299 | | name | g3.cores4.ram8.disk20 | | os-flavor-access:is_public | True | | properties | aggregate_instance_extra_specs:ceph='true', quota:disk_read_iops_sec='5000', quota:disk_total_bytes_sec='200000000', quota:disk_write_iops_sec='500' | | ram | 8192 | | rxtx_factor | 1.0 | | swap | | | vcpus | 4 | +----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ ...and here's a bad one... root@cloudcontrol1003:~# openstack flavor show g3.cores24.ram122.disk20 +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | ['wikiwho'] | | description | None | | disk | 20 | | id | 6207fade-5517-4ac4-a147-6ae0c8fa1384 | | name | g3.cores24.ram122.disk20 | | os-flavor-access:is_public | False | | properties | | | ram | 124928 | | rxtx_factor | 1.0 | | swap | | | vcpus | 24 | +----------------------------+--------------------------------------+ Note the lack of properties set on the latter. We should definitely investigate whether to have flavors without properties have good defaults -- at this point we're setting those same properties for almost every flavor which seems silly. In the meantime, though, it's important that those properties be specified on new flavors, otherwise VMs with those flavors can be consigned to purgatory and weird behavior. The docs about flavor creation are here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_flavors#Gener… most likely that should be duplicated or linked from other places though. Thanks! -A

2 2

Current layout of gitlab repos for cloud
by Arturo Borrero Gonzalez 08 Apr '22

08 Apr '22

Hi there, although I don't think there is a strong mandate -or policy- to migrate everything to gitlab just yet, we already have a few repos here and there. And there is agreement that gitlab in general is beneficial for us and we should be slowly paying more attention to it as we go. Naming can be hard, and organizing stuff based on naming can also be hard. But we already have some kind of "tree" on gitlab. Let me try to brain dump the mental image I have, and feel free to share / discuss as required. Main entry point: https://gitlab.wikimedia.org/repos/cloud The repos/cloud group is our main "directory". Everything that resembles something that WMCS + collabs would maintain as part of the base services+infra we have should be there. Explicitly excluded are user-controlled stuff, like Toolforge tool source code etc. Remember, being under /repos/ gives us some additional features (like trusted CI runners) and even more /repos/cloud/ already contains some membership that will be inherited by all child repos. This group doesn't contain any direct repository, but instead several subgroups. The only exception is a "wikistats"[1] repo that I'll be requesting Daniel Zahn to relocate elsewhere. Child subgroup: https://gitlab.wikimedia.org/repos/cloud/toolforge Everything Toolforge, components for k8s, etc. The only "official" repo here is https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx and the rest are placeholder/tests Child subgroup: https://gitlab.wikimedia.org/repos/cloud/cicd See https://lists.wikimedia.org/hyperkitty/list/cloud-admin@lists.wikimedia.org… Child subgroup: https://gitlab.wikimedia.org/repos/cloud/deb I just created this the other day to store .deb packaging repos. Created the first here https://gitlab.wikimedia.org/repos/cloud/deb/pkg-prometheus-openstack-expor… I guess we could introduce other subgroups as we go, imagine for example: * /repos/cloud/vps/ * /repos/cloud/paws/ * /repos/cloud/quarry/ */ repos/cloud/you_name_it/ Comments welcome. regards. [1] https://gitlab.wikimedia.org/repos/cloud/wikistats -- Arturo Borrero Gonzalez Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

1 0

Network operations today 2022-04-06
by Arturo Borrero Gonzalez 06 Apr '22

06 Apr '22

Hi there, Today 2022-04-06 we're performing some network maintenance operations on Cloud VPS that could affect all cloud egress/ingress traffic, including Toolforge. The cuts, if noticeable, should last a few minutes at most. Some operations were also conducted yesterday (without this email notice), and some unexpected hiccups occurred. That's why the email today. regards. -- Arturo Borrero Gonzalez Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

1 0

Proposal to work with gitlab.w.o CI for cloud stuff
by Arturo Borrero Gonzalez 26 Mar '22

26 Mar '22

Hi there, I've been playing this morning with gitlab.w.o and CI, and here is my proposal on how to work with it: * have all gitlab-ci related stuff consolidated into a single repository [0], this includes: ** generic gitlab-ci.yaml config files as required [1] ** Dockerfiles for the images used in the above gitlab-ci.yaml files [2] * in each repo that requires CI (likely all), instead of having to re-write the gitlab-ci.yaml file every time, "include" it from the repo above. [3] * due to upstream docker registry ratelimits, we need to do some heavy caching in our docker registry (docker-registry.tools.wmflabs.org), which even involves scp the base docker image from ones laptop (something like [4]) because you can't even pull the base images at docker.io from tools-docker-imagebuilder-01. To see a live demo/example of this, here is a successful tox job for a python 3.9 repository: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/job… I prefer if we have our own CI/CD stuff in a separate repo/docker registry for now. I don't think there is a unified effort for this unlike in gerrit. I think this kind of CI stuff is one of the main missing pieces that was previously preventing us from adopting gitlab for good. PD: there are a bunch of things to automate here, like base image maintenance and such. Will wait to see if this proposed workflow is something we're interested in. regards. [0] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci [1] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/py3.9-b… [2] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/py3.9-b… [3] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/blo… [4] https://stackoverflow.com/questions/23935141/how-to-copy-docker-images-from… -- Arturo Borrero Gonzalez Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation

4 7

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin