On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo <tylerromeo(a)gmail.com> wrote:
On Fri, Mar 7, 2014 at 5:39 PM, George Herbert
<george.herbert(a)gmail.com
wrote:
With all due respect; hell, yes, development
comes in second to
operational
stability.
This is not disrespecting development, which is extremely important by
any
measure. But we're running a top-10
worldwide website, a key worldwide
information resource for humanity as a whole. We cannot cripple
development to try and maximize stability, but stability has to be
priority
1. Any large website's teams will have the
same attitude.
I've had operational outages reach the top of everyone's news
source/feed/newspaper/broadcast. This is an exceptionally unpleasant
experience.
If you really think stability is top priority, then you cannot possibly
think that the current deployment process is sane.
Developers shouldn't be blocked on deployment or operations. Development is
expensive and things will break either way. It's good to assume things will
break and:
1. Have a simple way to revert
2. Put tests in for common errors
3. Have post-mortems where information is kept for historical purposes and
bugs are created to track action items that come from them
Right now you are placing the responsibility on the
developers to make sure
the site is stable, because any change they merge might break production
since it is automatically sent out. If anything that gives the appearance
that the operations team doesn't care about stability, and would rather
wait until things break and revert them.
Yes! This is a _good_ thing. Developers should feel responsible for what
they build. It's shouldn't be operation's job to make sure the site is
stable for code changes. Things should go more in this direction, in fact.
I'm not totally sure what you mean by "it's automatically sent out",
though. Deploys are manual.
It is the responsibility of the operations team to
ensure stability. Having
to revert something because that's the only way production will be stable
is not a proper workflow.
It's the responsibility of the operations team to ensure stability at the
infrastructure level, not at the application level. It's sane to expect to
revert things because things will break no matter what. Mean time to
recovery is just as important or more important than mean time between
failure.
- Ryan