There have been some user-facing DNS issues today. DNS is confusing and
I can't claim that I fully understand everything here, but here's the
best explanation/summary I have at the moment.
BACKGROUND
First, unlike what we thought before,
ns0/1.openstack.eqiad1.wikimediacloud.org. have glue records stored in
the .org registry. This is how Brandon explained that to me:
<taavi> bblack: but I still don't follow why that needs to be in
markmonitor. the affected domains use
ns0/1.openstack.eqiad1.wikimediacloud.org as the auth dns servers, and
wikimediacloud.org uses
ns0/1/2.wikimedia.org
<bblack> taavi: topranks: delegation of NS authority flows down the
namespace tree, not the tree of which domains "depend" on which in the
logical sense, that's why the markmonitor part matters.
<bblack> if you start from zero knowledge (cold dns cache), you start at
the root servers to find the .org servers, you ask the org servers about
<whatever>.org, and if the NS record is /also/ anywhere within .org,
even a different <some-other-thing>.org, then the .org nameservers must
serve the glue address
This means that changing operations/dns.git is not good enough for
updates to those specific addresses. This is what .org servers had until
today:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.135
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11
For comparation, this is what the zone files for
ns0/1/2.wikimedia.org
had, again before all of the cloudlb maintenance started:
ns0.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.148 ;
cloudservices1005
ns1.openstack.eqiad1.wikimediacloud.org. IN A 208.80.154.11 ;
cloudservices1004
You may notice the record for ns0 is different. 208.80.154.135 has
pointed to gerrit1003 since March (according to Netbox). So only one of
the two name servers that we had in the glue records has been working in
the first place.
The tricky part here is that different resolvers seem to be using
different sources for the nsX.openstack records, presumably due to
caches at various levels.
BREAKAGE
As a part of the cloudlb introduction, the AuthDNS addresses are being
moved to VIPs (185.15.56.162 and 185.15.56.163). cloudservices1006, the
first node in the new setup, is now serving the new ns1 address (.163).
ns1.openstack was changed in the
wikimediacloud.org zone files, but the
glue records in .org remained unchanged.
However, the old ns1 address was the only working glue record. So taking
down cloudservices1004 (the old ns0) broke clients that were using the
glue information. While it seems like stuff continued working for the
majority of people, we did have several people come ask about those
issues so there was some impact.
FIXES SO FAR
Two things were done this evening to fix the immediate issues:
* First, Rob H from the dc-ops team (and one of the few people who can
update our domain registrar) sent a message asking for the data in the
.org root to be updated to match the current status of the
wikimediacloud.org zone files.
https://phabricator.wikimedia.org/T346177#9161417
* Second, Cathal applied some network-level hacks to make the old ns1
record to answer queries again.
https://phabricator.wikimedia.org/T346177#9161474
NEXT STEPS
I think the steps to complete this migration without any further user
impact are roughly the following:
1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the
wikimediacloud.org zone file and the .org glue
records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox
record for it.
5. Move cloudservices1005 to the cloudlb network setup.
6. Move .162 (new ns0) from cloudservices1006 to 1005.
Taavi