On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:
On Fri, Mar 7, 2014 at 5:39 PM, George Herbert <george.herbert@gmail.com
wrote:
With all due respect; hell, yes, development comes in second to
operational
stability.
This is not disrespecting development, which is extremely important by
any
measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be
priority
- Any large website's teams will have the same attitude.
I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.
If you really think stability is top priority, then you cannot possibly think that the current deployment process is sane.
Developers shouldn't be blocked on deployment or operations. Development is expensive and things will break either way. It's good to assume things will break and:
1. Have a simple way to revert 2. Put tests in for common errors 3. Have post-mortems where information is kept for historical purposes and bugs are created to track action items that come from them
Right now you are placing the responsibility on the developers to make sure the site is stable, because any change they merge might break production since it is automatically sent out. If anything that gives the appearance that the operations team doesn't care about stability, and would rather wait until things break and revert them.
Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in fact.
I'm not totally sure what you mean by "it's automatically sent out", though. Deploys are manual.
It is the responsibility of the operations team to ensure stability. Having to revert something because that's the only way production will be stable is not a proper workflow.
It's the responsibility of the operations team to ensure stability at the infrastructure level, not at the application level. It's sane to expect to revert things because things will break no matter what. Mean time to recovery is just as important or more important than mean time between failure.
- Ryan