Hi,
this is a summary of the incident that happened this few days related to the buildservice and the distroless docker image. This incident is considered solved and the information is left here for the record and future references.
* As mentioned in phabricator [0], the buildservice's tekton pipeline requires a distroless docker image that we copy from upstream into our internal docker registry. * Because a comment on the upstream tekton code, the image was being used without :tag but with a digest reference, mentioning CRI-O support problems. We don't use CRI-O in our k8s clusters anyway. * During a maintenance operation on our internal docker registry [1], this tag-less docker image was removed. * The k8s clusters and the Toolforge buildservice kept working normally because the image was cached locally in each k8s worker node docker image cache. * I detected the missing image when trying to replicate the buildservice setup in my local machine using lima-kilo. The buildservice couldn't complete a image build because the missing distroless image. * I tried different strategies to bring back the missing docker image, including: ** upload a new image copying from the upstream distroless one. The cookbook worked, but since the image lacked the :debug tag, the buildservice could not work. ** rescue the image from a hot cache in a worker node. This worked, I could rescue the image, but the docker registry would remain in an inconsistent state after injecting it, meaning that the image could not be pulled. * Because the above operations, I couldn't find any combination of docker registry state and buildservice configuration that could work, so I decided to inject a new image called `toolforge-distroless-base-debug:latest`. Note the `-debug` suffix and the `:latest` tag. * This solved the problem, the docker registry started to happily serve the new image (and the old one, WTH??) and it was a valid image for the buildservice. * I think we can leave the buildservice configured like this, pulling the explicit :debug tag in the upstream image and pushing with -debug:latest to our internal registry. This means two things: ** we don't have to manually bump the digest in the buildservice code every time the upstream distroless image changes ** changes in the upstream distroless image could go unnoticed. * We can re-evaluate this situation later, but I consider the current state to be stable enough as of this writing.
I have detected that more modern tekton versions change this distroless image yet again, so we should do a careful evaluation when doing a future migration.
Another takeaway is to be very careful with operating the docker registry content, because it is really easy to have a mishap and extremely difficult to recover. Maybe better treat the docker registry as an append-only DB.
regards.
[0] https://phabricator.wikimedia.org/T321188 [1] https://sal.toolforge.org/log/1guHQYkBhuQtenzv9YOl
cloud-admin@lists.wikimedia.org