[Labs-l] Volunteers wanted to opt in to a new DNS system

Antoine Musso hashar+wmf at free.fr
Tue Apr 7 14:20:36 UTC 2015


On 03/04/15 19:38, Andrew Bogott wrote:
<snip>
> Additionally, I would appreciate it if a few projects would volunteer to
> be early adopters.  If you're interested in trying it out, please
> respond to this email so that I know who's trying, and then go to your
> 'configure instance' pages and clear the 'use_dnsmasq' setting.  If your
> instance is using role::puppet::self, you'll also need to sign a new
> puppet cert, like this:
>
> $ sudo puppet cert sign <hostname>.<projectname>.eqiad.wmflabs
>
> In addition to being more reliable, the new DNS system will also support
> names that include the project name, like
> 'util-abogott.testlabs.eqiad.wmflabs'.  The old naming scheme is still
> supported, but many services will be gradually moving over to the new
> scheme to avoid ambiguity between projects.
>
> After a few weeks of testing I'll start to migrate everything to the new
> server if things look good.  Let me know how things go.
>
> Thanks!
>
> -Andrew
>
>
> [1] The new system uses openstack-designate to create dns entries which
> are subsequently served by a powerdns server running on
> labs-ns2.wikimedia.org

Hello,

The 'integration' labs project has been switched to that new DNS by 
mistake which caused a partial outage on CI.


The use_dnsmasq (which is set to true on instances) has been renamed to 
'use_dnsmasq_server' when support for hiera has been added with:
https://gerrit.wikimedia.org/r/#/c/202278/

That immediately caused puppet client on the integration run to switch 
to the new DNS resolver which caused two major issues:


A) all puppet client suddenly refused connection due to the certname 
being based on the hostname instead of the ec2id

B) Jenkins jobs hitting the beta cluster all failed because the 
resolution of *.beta.wmflabs.org DNS entries yields the public instance 
IP which is not reacheable.


I have filled https://phabricator.wikimedia.org/T95273 which contains 
the work I did to revert back to the previous state. Namely:

* Have hiera set both use_dnsmasq and use_dnsmasq_server 
https://wikitech.wikimedia.org/w/index.php?title=Hiera:Integration&diff=152484&oldid=152033

* Manually reinstall / fix configuration on the puppetmaster since files 
have gone wild.
* Fight with cert regeneration on puppetmaster
* Manually fix /etc/resolv.conf on all instances and restart ncsd to 
clear the DNS cache
* Delete and invalidate certs of all clients and resign them.

The beta cluster which uses a puppetmaster has not been affected 
luckily. No idea why though.


I have filled https://phabricator.wikimedia.org/T95288 to have Designate 
yield different answer based on the client (a feature known as split 
horizon).


-- 
Antoine Musso










More information about the Labs-l mailing list