After talking with both Arturo and Birgit about things we might
present at Wikimania, I came up with this abstract for a talk:
Co-creating platforms and products: how the Wikimedia Cloud Services
team works with the larger Wikimedia technical community to build and
maintain Cloud VPS, Toolforge, Quarry, PAWS, and more
Did you know that volunteers are involved in planning, building, and
maintaining the Cloud VPS and Toolforge projects as co-equals with
paid staff from the Wikimedia Foundation? Since the start of the
"Labs" project in 2011, one of the guiding principles for WMCS
projects has been improving collaboration between Foundation staff and
technical volunteers. Learn more about some of the policies and
practices that are used to make this collaboration possible.
The submission would be under either the "governance" or "technology"
tracks. I think it would work best as a panel discussion that is
either "hybrid" (some folks in Singapore, some on-line) or
I think this is something that folks in the community might be
interested in learning a bit about. I also think it would be
interesting for those of us who have participated in this process to
take some time to reflect on how we have worked together in the past
and how we might like to see those those processes and practices
evolve in the future. To make this talk work well there should be
active voices from both the paid and volunteer staff involved. Towards
that end, I'm mailing the cloud-admin@ list + 4 of you that I know
have been active in the past in helping with Toolforge and/or Cloud
VPS admin and features work to gauge your interest in participating.
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
This contains information on the projects that I have been working on lately and
how to continue work if necessary for any continuity reasons. I figured I would
send an email because I'm going to be on PTO (vacations) for the next 5 weeks. I
will be back to work on 2023-09-04.
** cloudlb project @ eqiad1  Current status is that cloudcontrol1005 is
already in the new setup. We are waiting for cloudservices1006 to be racked 
before doing anything else. Cloudservices1006 should go directly into the new
.eqiad.wmnet setup with cloud-private and BGP-based VIPs et all. I bootstraped
the config (even if the server is not there yet) in a gerrit patch , to be
merged when the server is in place.
The idea is that cloudcontrol1005+cloudservices1006 will make for a tiny
openstack control plane, enough for us to move the
'openstack.eqiad1.wikimediacloud.org' endpoint to cloudlb1001/cloudlb1002 (BGP
VIP). Also, to move all the VMs to the new DNS servers .
Once this endpoint is migrated, we should decom / rerack / rename / rerack the
rest of cloudcontrols and cloudservices1005 according to the plans ,
therefore scaling up the newer cloud-private-based control plane.
This is exactly the codfw1dev setup, and it works really well as a setup
verification / comparison.
This is a quarter KR/goal. You (Andrew?) should feel free to take over this one
or wait until I'm back. Cathal @ NetOps knows this project in deep, including
the network implementation details, and should be available to help and assist
The work here is exciting and will also open the door to work on the kubernetes
undercloud  (see below).
** Toolforge build service . I was trying to get this patch developed, but
got into the rabbit hole of making the buildservice development environment
setup reproducible using lima-kilo . This went surprisingly well, and the
only missing bit is what to do with harbor , which may not even be a blocker
to develop the builds API code itself. This also touches on the helm vs secrets
Bonus point is that lima-kilo also helped me migrate the jobs-framework-emailer
to the new toolforge CI/CD setup "easily" .
Looking forward to continue work in this space, but David/Raymond should feel
free to take over as required / desired.
** kubernetes undercloud @ codfw . Not much here at the moment, but we are
approaching the point in which hardware will be available in codfw for us to
start playing . Cathal, Nicholas and I already shared a few comments on IRC
about switches, rack footprint, etc. I belive if we keep pushing in the right
direction, we may hit the ground running and get a POC bootstrapped in september ??
To be clear: DON'T use the hardware to refresh codfw1dev. Think first if we can
use it to build codfw2dev (or whatever the name). Or maybe refresh codfw1dev but
don't decomm the replaced hardware just yet. We'll need a buffer of hardware to
play with kubernetes and openstack-helm.
We don't even have an explicit KR/Goal for this in this quarter, but definitely
worth keeping in the radar given the hardware is arriving soon.
** Toolforge kubernetes upgrade to 1.23 . This is a quarter KR / goal. But
nothing done in this space. Taavi should feel free to continue without me if
desired / required.
** Finally, you can check my personal phabricator workboard . I track all my
tasks in there.
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
this is a summary of the incident that happened this few days related to the
buildservice and the distroless docker image. This incident is considered solved
and the information is left here for the record and future references.
* As mentioned in phabricator , the buildservice's tekton pipeline requires a
distroless docker image that we copy from upstream into our internal docker
* Because a comment on the upstream tekton code, the image was being used
without :tag but with a digest reference, mentioning CRI-O support problems. We
don't use CRI-O in our k8s clusters anyway.
* During a maintenance operation on our internal docker registry , this
tag-less docker image was removed.
* The k8s clusters and the Toolforge buildservice kept working normally because
the image was cached locally in each k8s worker node docker image cache.
* I detected the missing image when trying to replicate the buildservice setup
in my local machine using lima-kilo. The buildservice couldn't complete a image
build because the missing distroless image.
* I tried different strategies to bring back the missing docker image, including:
** upload a new image copying from the upstream distroless one. The cookbook
worked, but since the image lacked the :debug tag, the buildservice could not work.
** rescue the image from a hot cache in a worker node. This worked, I could
rescue the image, but the docker registry would remain in an inconsistent state
after injecting it, meaning that the image could not be pulled.
* Because the above operations, I couldn't find any combination of docker
registry state and buildservice configuration that could work, so I decided to
inject a new image called `toolforge-distroless-base-debug:latest`. Note the
`-debug` suffix and the `:latest` tag.
* This solved the problem, the docker registry started to happily serve the new
image (and the old one, WTH??) and it was a valid image for the buildservice.
* I think we can leave the buildservice configured like this, pulling the
explicit :debug tag in the upstream image and pushing with -debug:latest to our
internal registry. This means two things:
** we don't have to manually bump the digest in the buildservice code every time
the upstream distroless image changes
** changes in the upstream distroless image could go unnoticed.
* We can re-evaluate this situation later, but I consider the current state to
be stable enough as of this writing.
I have detected that more modern tekton versions change this distroless image
yet again, so we should do a careful evaluation when doing a future migration.
Another takeaway is to be very careful with operating the docker registry
content, because it is really easy to have a mishap and extremely difficult to
recover. Maybe better treat the docker registry as an append-only DB.
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services