Re: [Wikitech-l] Gerrit Commit Wars

8 Mar 2014

On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo &lt;tylerromeo(a)gmail.com&gt; wrote:

...
  On Fri, Mar 7, 2014 at 5:39 PM, George Herbert
&lt;george.herbert(a)gmail.com
 wrote: 
  With all due respect; hell, yes, development
comes in second to  operational
  stability.

 This is not disrespecting development, which is extremely important by  any
  measure.  But we're running a top-10
worldwide website, a key worldwide
 information resource for humanity as a whole.  We cannot cripple
 development to try and maximize stability, but stability has to be  priority
  1.  Any large website's teams will have the
same attitude.

 I've had operational outages reach the top of everyone's news
 source/feed/newspaper/broadcast.  This is an exceptionally unpleasant
 experience.

 If you really think stability is top priority, then you cannot possibly
 think that the current deployment process is sane.

 Developers shouldn't be blocked on deployment or operations. Development is
expensive and things will break either way. It's good to assume things will
break and:

1. Have a simple way to revert
2. Put tests in for common errors
3. Have post-mortems where information is kept for historical purposes and
bugs are created to track action items that come from them

...
  Right now you are placing the responsibility on the
developers to make sure
 the site is stable, because any change they merge might break production
 since it is automatically sent out. If anything that gives the appearance
 that the operations team doesn't care about stability, and would rather
 wait until things break and revert them.

 Yes! This is a _good_ thing. Developers should feel responsible for what
they build. It's shouldn't be operation's job to make sure the site is
stable for code changes. Things should go more in this direction, in fact.

I'm not totally sure what you mean by "it's automatically sent out",
though. Deploys are manual.

...
  It is the responsibility of the operations team to
ensure stability. Having
 to revert something because that's the only way production will be stable
 is not a proper workflow.

 It's the responsibility of the operations team to ensure stability at the
infrastructure level, not at the application level. It's sane to expect to
revert things because things will break no matter what. Mean time to
recovery is just as important or more important than mean time between
failure.

- Ryan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Gerrit Commit Wars