There have been some user-facing DNS issues today. DNS is confusing and
I can't claim that I fully understand everything here, but here's the
best explanation/summary I have at the moment.
BACKGROUND
First, unlike what we thought before,
ns0/1.openstack.eqiad1.wikimediacloud.org. have glue records stored in
the .org registry. This is how Brandon explained that to me:
<taavi> bblack: but I still don't follow why that needs to be in
markmonitor. the affected domains use
ns0/1.openstack.eqiad1.wikimediacloud.org as the auth dns servers, and
wikimediacloud.org uses ns0/1/2.wikimedia.org
<bblack> taavi: topranks: delegation of NS authority flows down the
namespace tree, not the tree of which domains "depend" on which in the
logical sense, that's why the markmonitor part matters.
<bblack> if you start from zero knowledge (cold dns cache), you start at
the root servers to find the .org servers, you ask the org servers about
<whatever>.org, and if the NS record is /also/ anywhere within .org,
even a different <some-other-thing>.org, then the .org nameservers must
serve the glue address
This means that changing operations/dns.git is not good enough for
updates to those specific addresses. This is what .org servers had until
today:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.135
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11
For comparation, this is what the zone files for ns0/1/2.wikimedia.org
had, again before all of the cloudlb maintenance started:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.148 ;
cloudservices1005
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11 ;
cloudservices1004
You may notice the record for ns0 is different. 208.80.154.135 has
pointed to gerrit1003 since March (according to Netbox). So only one of
the two name servers that we had in the glue records has been working in
the first place.
The tricky part here is that different resolvers seem to be using
different sources for the nsX.openstack records, presumably due to
caches at various levels.
BREAKAGE
As a part of the cloudlb introduction, the AuthDNS addresses are being
moved to VIPs (185.15.56.162 and 185.15.56.163). cloudservices1006, the
first node in the new setup, is now serving the new ns1 address (.163).
ns1.openstack was changed in the wikimediacloud.org zone files, but the
glue records in .org remained unchanged.
However, the old ns1 address was the only working glue record. So taking
down cloudservices1004 (the old ns0) broke clients that were using the
glue information. While it seems like stuff continued working for the
majority of people, we did have several people come ask about those
issues so there was some impact.
FIXES SO FAR
Two things were done this evening to fix the immediate issues:
* First, Rob H from the dc-ops team (and one of the few people who can
update our domain registrar) sent a message asking for the data in the
.org root to be updated to match the current status of the
wikimediacloud.org zone files.
https://phabricator.wikimedia.org/T346177#9161417
* Second, Cathal applied some network-level hacks to make the old ns1
record to answer queries again.
https://phabricator.wikimedia.org/T346177#9161474
NEXT STEPS
I think the steps to complete this migration without any further user
impact are roughly the following:
1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the wikimediacloud.org zone file and the .org glue
records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox
record for it.
5. Move cloudservices1005 to the cloudlb network setup.
6. Move .162 (new ns0) from cloudservices1006 to 1005.
Taavi
Hi,
today 2023-09-11 we will be conducting some internal Cloud VPS DNS service
operations:
* change the DNS recursor of every virtual machine running Cloud VPS from
208.80.154.143 and 208.80.154.24 to 172.20.255.1 (this is traditionally done via
/etc/resolv.conf)
* change the real server behind the authorizative DNS
ns1.openstack.eqiad1.wikimediacloud.org, including the IP address, from
208.80.154.11 to 185.15.56.163
This may affect briefly some virtual machines, but the new DNS servers have been
running for a while already and we are not anticipating a major impact (famous
last words?).
Please report any problems you may find.
Some phabricator tickets tracking this work are:
* https://phabricator.wikimedia.org/T345240 cloudservices1006: put into service
* https://phabricator.wikimedia.org/T346033 cloudservices1004: decomission
* https://phabricator.wikimedia.org/T342621 eqiad1: cloudlb: transition DNS
clients (VMs) to the new BGP-based recursor VIP
regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hello Admins,
As communicated earlier, we have put together a list of about 100 tools
whose maintainers we propose to invite in the next round of testing.
This expanded list now includes tools written in languages other than
Python.
You can see the list here[0]
The feedback and suggestions around custom and secret environment variable
support[1] and package installation for buildservice[2] have all now been
successfully rolled out to toolforge.
If there's no changes requested, the new invites will be sent on the 23rd
of Jun.
Kindly reach out if you have any questions or feedback.
Thank you!
[0] https://etherpad.wikimedia.org/p/second-round
[1]
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Install_ap…
[2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Envvars_Service
After talking with both Arturo and Birgit about things we might
present at Wikimania, I came up with this abstract for a talk:
Co-creating platforms and products: how the Wikimedia Cloud Services
team works with the larger Wikimedia technical community to build and
maintain Cloud VPS, Toolforge, Quarry, PAWS, and more
Did you know that volunteers are involved in planning, building, and
maintaining the Cloud VPS and Toolforge projects as co-equals with
paid staff from the Wikimedia Foundation? Since the start of the
"Labs" project in 2011, one of the guiding principles for WMCS
projects has been improving collaboration between Foundation staff and
technical volunteers. Learn more about some of the policies and
practices that are used to make this collaboration possible.
The submission would be under either the "governance" or "technology"
tracks. I think it would work best as a panel discussion that is
either "hybrid" (some folks in Singapore, some on-line) or
pre-recorded video.
I think this is something that folks in the community might be
interested in learning a bit about. I also think it would be
interesting for those of us who have participated in this process to
take some time to reflect on how we have worked together in the past
and how we might like to see those those processes and practices
evolve in the future. To make this talk work well there should be
active voices from both the paid and volunteer staff involved. Towards
that end, I'm mailing the cloud-admin@ list + 4 of you that I know
have been active in the past in helping with Toolforge and/or Cloud
VPS admin and features work to gauge your interest in participating.
Thoughts?
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hi!
Given that in gitlab we can't add more than one reviewer, and to help the
flow of code reviews, I have started adding the label `Ready for review` as
an indication that the attention of someone else is needed in the patch,
and the removal of the label to indicate that the attention of the author
is needed (either changes required, questions need answering or MR
accepted).
I'm going to keep using that flow while it works, but I think it would be
useful for anyone else also if they want my or other's attention to their
MRs.
I'm also going over all the MRs in the following search relatively often to
review/help with code reviews, feel free to do the same:
https://gitlab.wikimedia.org/groups/repos/cloud/-/merge_requests?scope=all&…
If you use it and you think it's working well for most of us we can
document it as a best practice for the repos in gitlab.
Cheers!
Hi there,
This contains information on the projects that I have been working on lately and
how to continue work if necessary for any continuity reasons. I figured I would
send an email because I'm going to be on PTO (vacations) for the next 5 weeks. I
will be back to work on 2023-09-04.
** cloudlb project @ eqiad1 [1] Current status is that cloudcontrol1005 is
already in the new setup. We are waiting for cloudservices1006 to be racked [2]
before doing anything else. Cloudservices1006 should go directly into the new
.eqiad.wmnet setup with cloud-private and BGP-based VIPs et all. I bootstraped
the config (even if the server is not there yet) in a gerrit patch [3], to be
merged when the server is in place.
The idea is that cloudcontrol1005+cloudservices1006 will make for a tiny
openstack control plane, enough for us to move the
'openstack.eqiad1.wikimediacloud.org' endpoint to cloudlb1001/cloudlb1002 (BGP
VIP). Also, to move all the VMs to the new DNS servers [4].
Once this endpoint is migrated, we should decom / rerack / rename / rerack the
rest of cloudcontrols and cloudservices1005 according to the plans [5],
therefore scaling up the newer cloud-private-based control plane.
This is exactly the codfw1dev setup, and it works really well as a setup
verification / comparison.
This is a quarter KR/goal. You (Andrew?) should feel free to take over this one
or wait until I'm back. Cathal @ NetOps knows this project in deep, including
the network implementation details, and should be available to help and assist
as required.
The work here is exciting and will also open the door to work on the kubernetes
undercloud [10] (see below).
** Toolforge build service [6]. I was trying to get this patch developed, but
got into the rabbit hole of making the buildservice development environment
setup reproducible using lima-kilo [7]. This went surprisingly well, and the
only missing bit is what to do with harbor [8], which may not even be a blocker
to develop the builds API code itself. This also touches on the helm vs secrets
problem.
Bonus point is that lima-kilo also helped me migrate the jobs-framework-emailer
to the new toolforge CI/CD setup "easily" [9].
Looking forward to continue work in this space, but David/Raymond should feel
free to take over as required / desired.
** kubernetes undercloud @ codfw [10]. Not much here at the moment, but we are
approaching the point in which hardware will be available in codfw for us to
start playing [11]. Cathal, Nicholas and I already shared a few comments on IRC
about switches, rack footprint, etc. I belive if we keep pushing in the right
direction, we may hit the ground running and get a POC bootstrapped in september ??
To be clear: DON'T use the hardware to refresh codfw1dev. Think first if we can
use it to build codfw2dev (or whatever the name). Or maybe refresh codfw1dev but
don't decomm the replaced hardware just yet. We'll need a buffer of hardware to
play with kubernetes and openstack-helm.
We don't even have an explicit KR/Goal for this in this quarter, but definitely
worth keeping in the radar given the hardware is arriving soon.
** Toolforge kubernetes upgrade to 1.23 [12]. This is a quarter KR / goal. But
nothing done in this space. Taavi should feel free to continue without me if
desired / required.
** Finally, you can check my personal phabricator workboard [0]. I track all my
tasks in there.
regards.
[0] https://phabricator.wikimedia.org/tag/user-aborrero/
[1] https://phabricator.wikimedia.org/T341060
[2] https://phabricator.wikimedia.org/T342161
[3] https://gerrit.wikimedia.org/r/c/941383
[4] https://phabricator.wikimedia.org/T342621
[5] https://phabricator.wikimedia.org/T341494
[6] https://phabricator.wikimedia.org/T340031
[7] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/lima-…
[8] https://phabricator.wikimedia.org/T342853
[9] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-emailer
[10] https://phabricator.wikimedia.org/T342750
[11] https://phabricator.wikimedia.org/T342456
[12] https://phabricator.wikimedia.org/T298005
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi,
this is a summary of the incident that happened this few days related to the
buildservice and the distroless docker image. This incident is considered solved
and the information is left here for the record and future references.
* As mentioned in phabricator [0], the buildservice's tekton pipeline requires a
distroless docker image that we copy from upstream into our internal docker
registry.
* Because a comment on the upstream tekton code, the image was being used
without :tag but with a digest reference, mentioning CRI-O support problems. We
don't use CRI-O in our k8s clusters anyway.
* During a maintenance operation on our internal docker registry [1], this
tag-less docker image was removed.
* The k8s clusters and the Toolforge buildservice kept working normally because
the image was cached locally in each k8s worker node docker image cache.
* I detected the missing image when trying to replicate the buildservice setup
in my local machine using lima-kilo. The buildservice couldn't complete a image
build because the missing distroless image.
* I tried different strategies to bring back the missing docker image, including:
** upload a new image copying from the upstream distroless one. The cookbook
worked, but since the image lacked the :debug tag, the buildservice could not work.
** rescue the image from a hot cache in a worker node. This worked, I could
rescue the image, but the docker registry would remain in an inconsistent state
after injecting it, meaning that the image could not be pulled.
* Because the above operations, I couldn't find any combination of docker
registry state and buildservice configuration that could work, so I decided to
inject a new image called `toolforge-distroless-base-debug:latest`. Note the
`-debug` suffix and the `:latest` tag.
* This solved the problem, the docker registry started to happily serve the new
image (and the old one, WTH??) and it was a valid image for the buildservice.
* I think we can leave the buildservice configured like this, pulling the
explicit :debug tag in the upstream image and pushing with -debug:latest to our
internal registry. This means two things:
** we don't have to manually bump the digest in the buildservice code every time
the upstream distroless image changes
** changes in the upstream distroless image could go unnoticed.
* We can re-evaluate this situation later, but I consider the current state to
be stable enough as of this writing.
I have detected that more modern tekton versions change this distroless image
yet again, so we should do a careful evaluation when doing a future migration.
Another takeaway is to be very careful with operating the docker registry
content, because it is really easy to have a mishap and extremely difficult to
recover. Maybe better treat the docker registry as an append-only DB.
regards.
[0] https://phabricator.wikimedia.org/T321188
[1] https://sal.toolforge.org/log/1guHQYkBhuQtenzv9YOl
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hello Admins,
For the past month, we have been testing Toolforge Build Service[0].
We invited about 30 tool maintainers to test out the service.
A number of maintainers responded with valuable feedback.
From the feedback received, there’s immediate need for the support for the
following:
1.
Custom environment variable support to avoid having to create
configuration files in the NFS/tool home.
2.
Package installation for some specific cases in which os/system
libraries are needed.
3.
Secrets management solution, that work similarly as environment
variables, this will remove the need to store them on NFS/tool home.
Work is underway to add support for all the above before the next round of
invites.
-
For envvars (and secrets with it), see[1] for details.
-
For arbitrary packages installation, see[2].
There's also work on adding documentation for other languages (php being
the next focus) so we can broaden the target languages.[3]
For the next steps, we propose to expand the invitation to a higher number
of tool maintainers as soon as the above services/features are available.
The target date to gather the new list of invitees is Tuesday, the 20th of
June, but will reconsider if the services are not yet ready.
On that day, we will send another email to this list with the new list of
proposed invites, that will be sent the day after if there's no comments.
Please reach out if you have any questions or feedback.
For more details about the release timeline see the dedicated task [4]
For more details about the beta release itself, please see the following:
Current project board [5].
The project page[6]
Release discovered tasks[7]:
Thanks for your support!
[0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
[1]: https://phabricator.wikimedia.org/T337538
[2]: https://phabricator.wikimedia.org/T336669
[3]: https://phabricator.wikimedia.org/T337397
[4]: https://phabricator.wikimedia.org/T335249
[5]: https://phabricator.wikimedia.org/project/board/5596/
[6]:
https://wikitech.wikimedia.org/wPortal:Toolforge/Ongoing_Efforts/Toolforge_…
[7]: https://phabricator.wikimedia.org/project/view/6529/
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi!
I have been moving some of the toolforge related repositories from gerrit
to gitlab, you will find all of them under:
https://gitlab.wikimedia.org/repos/cloud/toolforge
I have renamed also a few of them to make the namings a bit more consistent:
* Renamed toolforge-builds-api -> builds-api
* Renamed toolforge-envvars-cli -> envvars-cli
* Renamed toolforge-envvars-api -> envvars-api
So please make sure to change the urls of your repos to the new urls:
```
git remote set-url origin git(a)gitlab.wikimedia.org/repos/cloud/toolforge/
<new_repo_name>
```
This will allow us to better setup CI and CD pipelines to build and deploy
toolforge.
There's an ongoing decision about the exact process here:
https://phabricator.wikimedia.org/T339198
Make sure to write down your ideas there!
Have a nice weekend,
David
Hi cloud admins!
My name is Hal Triedman — I'm a Privacy Engineer at WMF, but in my spare
time I do a lot of work on machine learning. One of the things we've been
looking into is the creation of label-query datasets for Mediawiki database
queries, with the goal of being able to finetune an AI model to help users
write queries with more ease/create embeddings that allow for easier
searching of past queries.
Quarry is particularly interesting for this project because it has the
following qualities:
1) it is entirely on Mediawiki databases
2) it has been used to make hundreds of thousands of queries
3) many of those queries have relatively descriptive titles about what is
happening in the SQL
Is there any easy way of assembling a database of existing public
title-query pairs (i.e. by running a database query that excludes things
like "Untitled query", or just pulling published queries)? Please let me
know, and thanks.
Hal