Hi,
we plan on moving the Cloud VPS restricted bastion [1] to a new VM
based on Bookworm. The hostname will remain the same
(restricted.bastion.wmcloud.org) but it will point to a new VM running
Bookworm [2].
This will happen later today. If you SSH to a Cloud VPS instance after
this change, you will get an error and you will have to update the
fingerprint for the bastion in your "known_hosts" file.
When the new server is live, I will update the fingerprints listed in
wikitech [3], so please verify they match what you see in your
terminal before accepting them. (Ideally this would be handled by
wmf-sre-laptop, see T329322.)
Thanks,
Francesco
[1] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Setup
[2] https://phabricator.wikimedia.org/T340241#9202859
[3] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastio…
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
There have been some user-facing DNS issues today. DNS is confusing and
I can't claim that I fully understand everything here, but here's the
best explanation/summary I have at the moment.
BACKGROUND
First, unlike what we thought before,
ns0/1.openstack.eqiad1.wikimediacloud.org. have glue records stored in
the .org registry. This is how Brandon explained that to me:
<taavi> bblack: but I still don't follow why that needs to be in
markmonitor. the affected domains use
ns0/1.openstack.eqiad1.wikimediacloud.org as the auth dns servers, and
wikimediacloud.org uses ns0/1/2.wikimedia.org
<bblack> taavi: topranks: delegation of NS authority flows down the
namespace tree, not the tree of which domains "depend" on which in the
logical sense, that's why the markmonitor part matters.
<bblack> if you start from zero knowledge (cold dns cache), you start at
the root servers to find the .org servers, you ask the org servers about
<whatever>.org, and if the NS record is /also/ anywhere within .org,
even a different <some-other-thing>.org, then the .org nameservers must
serve the glue address
This means that changing operations/dns.git is not good enough for
updates to those specific addresses. This is what .org servers had until
today:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.135
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11
For comparation, this is what the zone files for ns0/1/2.wikimedia.org
had, again before all of the cloudlb maintenance started:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.148 ;
cloudservices1005
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11 ;
cloudservices1004
You may notice the record for ns0 is different. 208.80.154.135 has
pointed to gerrit1003 since March (according to Netbox). So only one of
the two name servers that we had in the glue records has been working in
the first place.
The tricky part here is that different resolvers seem to be using
different sources for the nsX.openstack records, presumably due to
caches at various levels.
BREAKAGE
As a part of the cloudlb introduction, the AuthDNS addresses are being
moved to VIPs (185.15.56.162 and 185.15.56.163). cloudservices1006, the
first node in the new setup, is now serving the new ns1 address (.163).
ns1.openstack was changed in the wikimediacloud.org zone files, but the
glue records in .org remained unchanged.
However, the old ns1 address was the only working glue record. So taking
down cloudservices1004 (the old ns0) broke clients that were using the
glue information. While it seems like stuff continued working for the
majority of people, we did have several people come ask about those
issues so there was some impact.
FIXES SO FAR
Two things were done this evening to fix the immediate issues:
* First, Rob H from the dc-ops team (and one of the few people who can
update our domain registrar) sent a message asking for the data in the
.org root to be updated to match the current status of the
wikimediacloud.org zone files.
https://phabricator.wikimedia.org/T346177#9161417
* Second, Cathal applied some network-level hacks to make the old ns1
record to answer queries again.
https://phabricator.wikimedia.org/T346177#9161474
NEXT STEPS
I think the steps to complete this migration without any further user
impact are roughly the following:
1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the wikimediacloud.org zone file and the .org glue
records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox
record for it.
5. Move cloudservices1005 to the cloudlb network setup.
6. Move .162 (new ns0) from cloudservices1006 to 1005.
Taavi
Hi,
today 2023-09-11 we will be conducting some internal Cloud VPS DNS service
operations:
* change the DNS recursor of every virtual machine running Cloud VPS from
208.80.154.143 and 208.80.154.24 to 172.20.255.1 (this is traditionally done via
/etc/resolv.conf)
* change the real server behind the authorizative DNS
ns1.openstack.eqiad1.wikimediacloud.org, including the IP address, from
208.80.154.11 to 185.15.56.163
This may affect briefly some virtual machines, but the new DNS servers have been
running for a while already and we are not anticipating a major impact (famous
last words?).
Please report any problems you may find.
Some phabricator tickets tracking this work are:
* https://phabricator.wikimedia.org/T345240 cloudservices1006: put into service
* https://phabricator.wikimedia.org/T346033 cloudservices1004: decomission
* https://phabricator.wikimedia.org/T342621 eqiad1: cloudlb: transition DNS
clients (VMs) to the new BGP-based recursor VIP
regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hello Admins,
As communicated earlier, we have put together a list of about 100 tools
whose maintainers we propose to invite in the next round of testing.
This expanded list now includes tools written in languages other than
Python.
You can see the list here[0]
The feedback and suggestions around custom and secret environment variable
support[1] and package installation for buildservice[2] have all now been
successfully rolled out to toolforge.
If there's no changes requested, the new invites will be sent on the 23rd
of Jun.
Kindly reach out if you have any questions or feedback.
Thank you!
[0] https://etherpad.wikimedia.org/p/second-round
[1]
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Install_ap…
[2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Envvars_Service