On Wed, Mar 13, 2013 at 9:24 PM, Jay Ashworth jra@baylink.com wrote:
----- Original Message -----
From: "Ryan Lane" rlane32@gmail.com
Hey, Ryan; did you see, perhaps on outages-discussion, the after action report from Microsoft about how their Azure SSL cert expiration screwup happened?
What's the relevance here?
"Does ops have a procedure for avoiding unexpected SSL cert expirations, and does this affect it in any way other than making it easier to implement?", I would think...
We didn't have a certificate expiration. We replaced all individual certificates, delivered by different top level domains, with a single unified certificate. This change was to fix certificate errors being shown on all non-wikipedia domains for HTTPS mobile users, who were being delivered the *.wikipedia.org certificate for all domains.
The unified certificate was missing 6 Subject Alternative Names: mediawiki.org, *.mediawiki.org, m.mediawiki.org, *.m.mediawiki.org, m.wikipedia.org and *.m.wikipedia.org. Shortly after deploying the certificate we noticed it was bad and reverted the affected services ( mediawiki.org and mobile) back to their individual certificates. The change only affected a small portion of users for a short period of time.
If you notice, I've already mentioned how we'll avoid and more quickly detect problems like this in the future:
"Needless to say I'll be writing a script that can be run against a cert to ensure it's not missing anything. We'll also be adding monitoring to check for invalid certificates for any top level domain."
- Ryan