There have been some user-facing DNS issues today. DNS is confusing and I can't claim that I fully understand everything here, but here's the best explanation/summary I have at the moment.
BACKGROUND
First, unlike what we thought before, ns0/1.openstack.eqiad1.wikimediacloud.org. have glue records stored in the .org registry. This is how Brandon explained that to me:
<taavi> bblack: but I still don't follow why that needs to be in markmonitor. the affected domains use ns0/1.openstack.eqiad1.wikimediacloud.org as the auth dns servers, and wikimediacloud.org uses ns0/1/2.wikimedia.org <bblack> taavi: topranks: delegation of NS authority flows down the namespace tree, not the tree of which domains "depend" on which in the logical sense, that's why the markmonitor part matters. <bblack> if you start from zero knowledge (cold dns cache), you start at the root servers to find the .org servers, you ask the org servers about <whatever>.org, and if the NS record is /also/ anywhere within .org, even a different <some-other-thing>.org, then the .org nameservers must serve the glue address
This means that changing operations/dns.git is not good enough for updates to those specific addresses. This is what .org servers had until today:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.135 ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11
For comparation, this is what the zone files for ns0/1/2.wikimedia.org had, again before all of the cloudlb maintenance started:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.148 ; cloudservices1005 ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11 ; cloudservices1004
You may notice the record for ns0 is different. 208.80.154.135 has pointed to gerrit1003 since March (according to Netbox). So only one of the two name servers that we had in the glue records has been working in the first place.
The tricky part here is that different resolvers seem to be using different sources for the nsX.openstack records, presumably due to caches at various levels.
BREAKAGE
As a part of the cloudlb introduction, the AuthDNS addresses are being moved to VIPs (185.15.56.162 and 185.15.56.163). cloudservices1006, the first node in the new setup, is now serving the new ns1 address (.163). ns1.openstack was changed in the wikimediacloud.org zone files, but the glue records in .org remained unchanged.
However, the old ns1 address was the only working glue record. So taking down cloudservices1004 (the old ns0) broke clients that were using the glue information. While it seems like stuff continued working for the majority of people, we did have several people come ask about those issues so there was some impact.
FIXES SO FAR
Two things were done this evening to fix the immediate issues: * First, Rob H from the dc-ops team (and one of the few people who can update our domain registrar) sent a message asking for the data in the .org root to be updated to match the current status of the wikimediacloud.org zone files. https://phabricator.wikimedia.org/T346177#9161417 * Second, Cathal applied some network-level hacks to make the old ns1 record to answer queries again. https://phabricator.wikimedia.org/T346177#9161474
NEXT STEPS
I think the steps to complete this migration without any further user impact are roughly the following: 1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0). 2. Update both the wikimediacloud.org zone file and the .org glue records to reference .162 as the ns0 record. 3. Wait for all of the DNS TTLs for ns0 to expire. 4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox record for it. 5. Move cloudservices1005 to the cloudlb network setup. 6. Move .162 (new ns0) from cloudservices1006 to 1005.
Taavi