Re: [Wikitech-l] Gerrit Commit Wars

7 Mar 2014

      On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:
...
On Fri, Mar 7, 2014 at 5:39 PM, George Herbert <george.herbert@gmail.com
...
wrote:
...
With all due respect; hell, yes, development comes in second to
operational
...
stability.
This is not disrespecting development, which is extremely important by
any
...
measure.  But we're running a top-10 worldwide website, a key worldwide
information resource for humanity as a whole.  We cannot cripple
development to try and maximize stability, but stability has to be
priority
...

Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news
source/feed/newspaper/broadcast.  This is an exceptionally unpleasant
experience.
If you really think stability is top priority, then you cannot possibly
think that the current deployment process is sane.
Developers shouldn't be blocked on deployment or operations. Development is
expensive and things will break either way. It's good to assume things will
break and:
1. Have a simple way to revert
2. Put tests in for common errors
3. Have post-mortems where information is kept for historical purposes and
bugs are created to track action items that come from them
...
Right now you are placing the responsibility on the developers to make sure
the site is stable, because any change they merge might break production
since it is automatically sent out. If anything that gives the appearance
that the operations team doesn't care about stability, and would rather
wait until things break and revert them.
Yes! This is a _good_ thing. Developers should feel responsible for what
they build. It's shouldn't be operation's job to make sure the site is
stable for code changes. Things should go more in this direction, in fact.
I'm not totally sure what you mean by "it's automatically sent out",
though. Deploys are manual.
...
It is the responsibility of the operations team to ensure stability. Having
to revert something because that's the only way production will be stable
is not a proper workflow.
It's the responsibility of the operations team to ensure stability at the
infrastructure level, not at the application level. It's sane to expect to
revert things because things will break no matter what. Mean time to
recovery is just as important or more important than mean time between
failure.
- Ryan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Gerrit Commit Wars