Cloud-admin July 2023

cloud-admin@lists.wikimedia.org

2 participants
4 discussions

Second Round of Build Service Testing
by Seyram Komla Sapaty 04 Sep '23

04 Sep '23

Hello Admins, As communicated earlier, we have put together a list of about 100 tools whose maintainers we propose to invite in the next round of testing. This expanded list now includes tools written in languages other than Python. You can see the list here[0] The feedback and suggestions around custom and secret environment variable support[1] and package installation for buildservice[2] have all now been successfully rolled out to toolforge. If there's no changes requested, the new invites will be sent on the 23rd of Jun. Kindly reach out if you have any questions or feedback. Thank you! [0] https://etherpad.wikimedia.org/p/second-round [1] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Install_ap… [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Envvars_Service

3 4

Help with a Wikimania presentation on how Cloud Services governance works?
by Bryan Davis 11 Aug '23

11 Aug '23

After talking with both Arturo and Birgit about things we might present at Wikimania, I came up with this abstract for a talk: Co-creating platforms and products: how the Wikimedia Cloud Services team works with the larger Wikimedia technical community to build and maintain Cloud VPS, Toolforge, Quarry, PAWS, and more Did you know that volunteers are involved in planning, building, and maintaining the Cloud VPS and Toolforge projects as co-equals with paid staff from the Wikimedia Foundation? Since the start of the "Labs" project in 2011, one of the guiding principles for WMCS projects has been improving collaboration between Foundation staff and technical volunteers. Learn more about some of the policies and practices that are used to make this collaboration possible. The submission would be under either the "governance" or "technology" tracks. I think it would work best as a panel discussion that is either "hybrid" (some folks in Singapore, some on-line) or pre-recorded video. I think this is something that folks in the community might be interested in learning a bit about. I also think it would be interesting for those of us who have participated in this process to take some time to reflect on how we have worked together in the past and how we might like to see those those processes and practices evolve in the future. To make this talk work well there should be active voices from both the paid and volunteer staff involved. Towards that end, I'm mailing the cloud-admin@ list + 4 of you that I know have been active in the past in helping with Toolforge and/or Cloud VPS admin and features work to gauge your interest in participating. Thoughts? Bryan -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

6 14

Arturo's projects (vacations continuity plan)
by Arturo Borrero Gonzalez 28 Jul '23

28 Jul '23

Hi there, This contains information on the projects that I have been working on lately and how to continue work if necessary for any continuity reasons. I figured I would send an email because I'm going to be on PTO (vacations) for the next 5 weeks. I will be back to work on 2023-09-04. ** cloudlb project @ eqiad1 [1] Current status is that cloudcontrol1005 is already in the new setup. We are waiting for cloudservices1006 to be racked [2] before doing anything else. Cloudservices1006 should go directly into the new .eqiad.wmnet setup with cloud-private and BGP-based VIPs et all. I bootstraped the config (even if the server is not there yet) in a gerrit patch [3], to be merged when the server is in place. The idea is that cloudcontrol1005+cloudservices1006 will make for a tiny openstack control plane, enough for us to move the 'openstack.eqiad1.wikimediacloud.org' endpoint to cloudlb1001/cloudlb1002 (BGP VIP). Also, to move all the VMs to the new DNS servers [4]. Once this endpoint is migrated, we should decom / rerack / rename / rerack the rest of cloudcontrols and cloudservices1005 according to the plans [5], therefore scaling up the newer cloud-private-based control plane. This is exactly the codfw1dev setup, and it works really well as a setup verification / comparison. This is a quarter KR/goal. You (Andrew?) should feel free to take over this one or wait until I'm back. Cathal @ NetOps knows this project in deep, including the network implementation details, and should be available to help and assist as required. The work here is exciting and will also open the door to work on the kubernetes undercloud [10] (see below). ** Toolforge build service [6]. I was trying to get this patch developed, but got into the rabbit hole of making the buildservice development environment setup reproducible using lima-kilo [7]. This went surprisingly well, and the only missing bit is what to do with harbor [8], which may not even be a blocker to develop the builds API code itself. This also touches on the helm vs secrets problem. Bonus point is that lima-kilo also helped me migrate the jobs-framework-emailer to the new toolforge CI/CD setup "easily" [9]. Looking forward to continue work in this space, but David/Raymond should feel free to take over as required / desired. ** kubernetes undercloud @ codfw [10]. Not much here at the moment, but we are approaching the point in which hardware will be available in codfw for us to start playing [11]. Cathal, Nicholas and I already shared a few comments on IRC about switches, rack footprint, etc. I belive if we keep pushing in the right direction, we may hit the ground running and get a POC bootstrapped in september ?? To be clear: DON'T use the hardware to refresh codfw1dev. Think first if we can use it to build codfw2dev (or whatever the name). Or maybe refresh codfw1dev but don't decomm the replaced hardware just yet. We'll need a buffer of hardware to play with kubernetes and openstack-helm. We don't even have an explicit KR/Goal for this in this quarter, but definitely worth keeping in the radar given the hardware is arriving soon. ** Toolforge kubernetes upgrade to 1.23 [12]. This is a quarter KR / goal. But nothing done in this space. Taavi should feel free to continue without me if desired / required. ** Finally, you can check my personal phabricator workboard [0]. I track all my tasks in there. regards. [0] https://phabricator.wikimedia.org/tag/user-aborrero/ [1] https://phabricator.wikimedia.org/T341060 [2] https://phabricator.wikimedia.org/T342161 [3] https://gerrit.wikimedia.org/r/c/941383 [4] https://phabricator.wikimedia.org/T342621 [5] https://phabricator.wikimedia.org/T341494 [6] https://phabricator.wikimedia.org/T340031 [7] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/lima-… [8] https://phabricator.wikimedia.org/T342853 [9] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-emailer [10] https://phabricator.wikimedia.org/T342750 [11] https://phabricator.wikimedia.org/T342456 [12] https://phabricator.wikimedia.org/T298005 -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

Buildservice distroless image incident
by Arturo Borrero Gonzalez 20 Jul '23

20 Jul '23

Hi, this is a summary of the incident that happened this few days related to the buildservice and the distroless docker image. This incident is considered solved and the information is left here for the record and future references. * As mentioned in phabricator [0], the buildservice's tekton pipeline requires a distroless docker image that we copy from upstream into our internal docker registry. * Because a comment on the upstream tekton code, the image was being used without :tag but with a digest reference, mentioning CRI-O support problems. We don't use CRI-O in our k8s clusters anyway. * During a maintenance operation on our internal docker registry [1], this tag-less docker image was removed. * The k8s clusters and the Toolforge buildservice kept working normally because the image was cached locally in each k8s worker node docker image cache. * I detected the missing image when trying to replicate the buildservice setup in my local machine using lima-kilo. The buildservice couldn't complete a image build because the missing distroless image. * I tried different strategies to bring back the missing docker image, including: ** upload a new image copying from the upstream distroless one. The cookbook worked, but since the image lacked the :debug tag, the buildservice could not work. ** rescue the image from a hot cache in a worker node. This worked, I could rescue the image, but the docker registry would remain in an inconsistent state after injecting it, meaning that the image could not be pulled. * Because the above operations, I couldn't find any combination of docker registry state and buildservice configuration that could work, so I decided to inject a new image called `toolforge-distroless-base-debug:latest`. Note the `-debug` suffix and the `:latest` tag. * This solved the problem, the docker registry started to happily serve the new image (and the old one, WTH??) and it was a valid image for the buildservice. * I think we can leave the buildservice configured like this, pulling the explicit :debug tag in the upstream image and pushing with -debug:latest to our internal registry. This means two things: ** we don't have to manually bump the digest in the buildservice code every time the upstream distroless image changes ** changes in the upstream distroless image could go unnoticed. * We can re-evaluate this situation later, but I consider the current state to be stable enough as of this writing. I have detected that more modern tekton versions change this distroless image yet again, so we should do a careful evaluation when doing a future migration. Another takeaway is to be very careful with operating the docker registry content, because it is really easy to have a mishap and extremely difficult to recover. Maybe better treat the docker registry as an append-only DB. regards. [0] https://phabricator.wikimedia.org/T321188 [1] https://sal.toolforge.org/log/1guHQYkBhuQtenzv9YOl -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin July 2023