On 10/21/10 4:04 PM, Aryeh Gregor wrote:
On Thu, Oct 21, 2010 at 6:31 PM, Neil
Kandalgaonkar<neilk(a)wikimedia.org> wrote:
For what it's worth, I'm influenced by my
former job at Flickr, where
the practice was to deploy several times *per day*, directly from trunk.
That may be more extreme than we want but be aware there are people who
are doing it successfully -- it just takes a few extra development
practices.
Personally, I think it would be awesome if we could migrate to this
level of deployment frequency eventually. I imagine that
comprehensive automated test suites are a major part of making this
reliable.
Nope. Automated tests help a lot with this approach but Flickr doesn't
have much better tests than MediaWiki does.
We *should* have better tests, but I would just say that it is not
required for us to have a great test suite before doing this.
To the extent you can share any details about how
stuff
works at Flickr, what long-term changes are necessary for this to be
practical?
Flickr engineers have already talked a lot about this in public. See
references below.
The main insight here is that branching is a bad way for a website to
manage change. We do not have an install base that's out there in the
world, like shrink-wrapped software, where we issue patches on CD. For a
website, we control the entire install base.[1]
What we need are ways of managing change across our server clusters, or
managing incremental feature and infrastructure upgrades. This leads to
"branching in code".
Doing things the Flickr way entirely would require:
1 - A "feature flag" system, for "branching in code". The point is to
start developing a new feature with it being turned off by default for
most environments and without succumbing to branching and merging
misery. In other words, day one of a new feature looks like this:
if ( $wgFeature['MyNewThing'] ) {
/* ... new code ... */
} else {
/* ... old code ... */
}
Of course if you're fixing bugs there's no need to hide that behind a
feature flag.
2 - Every developer with commit access is thinking about deployment onto
a cluster of machines all the time. Committing to the repository means
you are asserting this will work in production. (This is the hard part
for us, I think, but maybe not insurmountable).
3 - One can deploy with a single button press (and there is a system
recording what changes were deployed and why, for ops' convenience).
4 - When there's trouble, new deploys can be blocked centrally, and then
ops can revert to a previous version with a single button press.
5 - Developers are good about "cleaning up" code that was previously
protected by feature flags once the behaviour is standard. (HINT: this
is the part Flickr doesn't talk about in public... but as an open source
project with more visible dirty laundry, perhaps we can do better.)
This system does result in more "oops" moments. But the point is to make
those easy to recover from, and to have a culture where people aren't
blamed too much for this. Not to make a system that tries to ensure that
deploy branches can be tested to be almost perfect. The real problems
are always things that nobody anticipated anyway.
NOTES
[1] I am for the purposes of the argument ignoring MediaWiki as a
deliverable and only thinking about project websites.
REFERENCES
Here's the most concise presentation:
"Always Ship Trunk: Managing Change In Complex Websites" by Paul Hammond
http://www.paulhammond.org/2010/06/trunk/alwaysshiptrunk.pdf
And a longer talk about all this from Paul Hammond and John Allspaw
10+ Deploys Per Day: Dev/Ops Cooperation at Flickr
http://velocityconference.blip.tv/file/2284377/
Blog post about the Feature Flag system by Ross Harmes
"Flipping out"
http://code.flickr.com/blog/2009/12/02/flipping-out/
--
Neil Kandalgaonkar ( ) <neilk(a)wikimedia.org>