----- Original Message -----
From: "Ryan Lane" rlane32@gmail.com
On Wed, Mar 13, 2013 at 9:24 PM, Jay Ashworth jra@baylink.com wrote:
----- Original Message -----
From: "Ryan Lane" rlane32@gmail.com
Hey, Ryan; did you see, perhaps on outages-discussion, the after action report from Microsoft about how their Azure SSL cert expiration screwup happened?
What's the relevance here?
"Does ops have a procedure for avoiding unexpected SSL cert expirations, and does this affect it in any way other than making it easier to implement?", I would think...
We didn't have a certificate expiration. We replaced all individual certificates, delivered by different top level domains, with a single unified certificate. This change was to fix certificate errors being shown on all non-wikipedia domains for HTTPS mobile users, who were being delivered the *.wikipedia.org certificate for all domains.
The unified certificate was missing 6 Subject Alternative Names: mediawiki.org, *.mediawiki.org, m.mediawiki.org, *.m.mediawiki.org, m.wikipedia.org and *.m.wikipedia.org. Shortly after deploying the certificate we noticed it was bad and reverted the affected services ( mediawiki.org and mobile) back to their individual certificates. The change only affected a small portion of users for a short period of time.
If you notice, I've already mentioned how we'll avoid and more quickly detect problems like this in the future:
"Needless to say I'll be writing a script that can be run against a cert to ensure it's not missing anything. We'll also be adding monitoring to check for invalid certificates for any top level domain."
I don't really think it was necessary to be this defensive, do you?
Well, clearly, you do. My apologies for trying to be helpful in making sure you saw an analysis with useful information in it.
Cheers, -- jra