Hi,
this is a summary of the incident that happened this few days related to the
buildservice and the distroless docker image. This incident is considered solved
and the information is left here for the record and future references.
* As mentioned in phabricator [0], the buildservice's tekton pipeline requires a
distroless docker image that we copy from upstream into our internal docker
registry.
* Because a comment on the upstream tekton code, the image was being used
without :tag but with a digest reference, mentioning CRI-O support problems. We
don't use CRI-O in our k8s clusters anyway.
* During a maintenance operation on our internal docker registry [1], this
tag-less docker image was removed.
* The k8s clusters and the Toolforge buildservice kept working normally because
the image was cached locally in each k8s worker node docker image cache.
* I detected the missing image when trying to replicate the buildservice setup
in my local machine using lima-kilo. The buildservice couldn't complete a image
build because the missing distroless image.
* I tried different strategies to bring back the missing docker image, including:
** upload a new image copying from the upstream distroless one. The cookbook
worked, but since the image lacked the :debug tag, the buildservice could not work.
** rescue the image from a hot cache in a worker node. This worked, I could
rescue the image, but the docker registry would remain in an inconsistent state
after injecting it, meaning that the image could not be pulled.
* Because the above operations, I couldn't find any combination of docker
registry state and buildservice configuration that could work, so I decided to
inject a new image called `toolforge-distroless-base-debug:latest`. Note the
`-debug` suffix and the `:latest` tag.
* This solved the problem, the docker registry started to happily serve the new
image (and the old one, WTH??) and it was a valid image for the buildservice.
* I think we can leave the buildservice configured like this, pulling the
explicit :debug tag in the upstream image and pushing with -debug:latest to our
internal registry. This means two things:
** we don't have to manually bump the digest in the buildservice code every time
the upstream distroless image changes
** changes in the upstream distroless image could go unnoticed.
* We can re-evaluate this situation later, but I consider the current state to
be stable enough as of this writing.
I have detected that more modern tekton versions change this distroless image
yet again, so we should do a careful evaluation when doing a future migration.
Another takeaway is to be very careful with operating the docker registry
content, because it is really easy to have a mishap and extremely difficult to
recover. Maybe better treat the docker registry as an append-only DB.
regards.
[0]
https://phabricator.wikimedia.org/T321188
[1]
https://sal.toolforge.org/log/1guHQYkBhuQtenzv9YOl
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation