Executive summary:
I messed with ldap today. Gerrit handles ldap differently from all other services, so it broke and it took several ops several hours to sort out what was happening. Everything is working again now.
Details:
As part of an elaborate post-tampa ballet[1] I moved the ldap servers from virt1000 and virt0 to ldap-eqiad (aka neptunium for the time being) and ldap-codfw (aka labcontrol2001). This change was made this morning via puppet[2].
Much to my delight, labs handled the change gracefully and without any service interruptions.
Wikitech suffered a brief outage because I neglected to note that it depends on an ldap server name in the mediawiki config. I hotfixed that on virt1000 and also submitted a proper patch[3] for review. With that change wikitech returned to normal, although (as usual) caches are broken and many users will have to log out and in again to get all the labs features they're used to.
With the change in ldap server, Gerrit logins went down and stayed down. At various times Marc, Rob, Brandon and I were all involved in troubleshooting. Several changes were made to the ldap setup cluster-wide[4][5] -- these changes are probably correct, but did Gerrit no good (and getting them applied w/out gerrit was no walk in the park.) After a great many more blind alleys, Marc noted that we typically handle ldap certificate validation by specifying a root cert in ldap.conf, and that is not the Proper Debian Way. Apparently we've just been lucky so far that most of our ldap services use ldap.conf rather than the systemwide ca-certificate system. The right solution is to drop trusted certs into /usr/local/share/ca-certificates and then regenerate /etc/ssl/ca-certificates.crt by running update-ca-certificates. Marc did this on ytterbium (the Gerrit host) and Gerrit immediately started working again. Remaining tasks are:
1) Puppetize Marc's hotfix[6]
2) (Maybe) totally refactor how we use ldap everywhere so that it conforms to Debian standards.
3) Document all the services that rely on ldap so the next time someone (me, probably) messes with it, they know what to watch for[7]
Many thanks to Marc, Rob and Brandon for joining in when I called out for help with this problem.
[1] https://wikitech.wikimedia.org/wiki/Ldap_rename [2] https://gerrit.wikimedia.org/r/#/c/162689/ [3] https://gerrit.wikimedia.org/r/#/c/163189/ [4] https://gerrit.wikimedia.org/r/#/c/163183/ [5] https://gerrit.wikimedia.org/r/#/c/163194/ [6] https://gerrit.wikimedia.org/r/163222 [7] https://wikitech.wikimedia.org/wiki/LDAP
wikitech-l@lists.wikimedia.org