On 10/21/10 4:04 PM, Aryeh Gregor wrote:
On Thu, Oct 21, 2010 at 6:31 PM, Neil Kandalgaonkarneilk@wikimedia.org wrote:
For what it's worth, I'm influenced by my former job at Flickr, where the practice was to deploy several times *per day*, directly from trunk. That may be more extreme than we want but be aware there are people who are doing it successfully -- it just takes a few extra development practices.
Personally, I think it would be awesome if we could migrate to this level of deployment frequency eventually. I imagine that comprehensive automated test suites are a major part of making this reliable.
Nope. Automated tests help a lot with this approach but Flickr doesn't have much better tests than MediaWiki does.
We *should* have better tests, but I would just say that it is not required for us to have a great test suite before doing this.
To the extent you can share any details about how stuff works at Flickr, what long-term changes are necessary for this to be practical?
Flickr engineers have already talked a lot about this in public. See references below.
The main insight here is that branching is a bad way for a website to manage change. We do not have an install base that's out there in the world, like shrink-wrapped software, where we issue patches on CD. For a website, we control the entire install base.[1]
What we need are ways of managing change across our server clusters, or managing incremental feature and infrastructure upgrades. This leads to "branching in code".
Doing things the Flickr way entirely would require:
1 - A "feature flag" system, for "branching in code". The point is to start developing a new feature with it being turned off by default for most environments and without succumbing to branching and merging misery. In other words, day one of a new feature looks like this:
if ( $wgFeature['MyNewThing'] ) { /* ... new code ... */ } else { /* ... old code ... */ }
Of course if you're fixing bugs there's no need to hide that behind a feature flag.
2 - Every developer with commit access is thinking about deployment onto a cluster of machines all the time. Committing to the repository means you are asserting this will work in production. (This is the hard part for us, I think, but maybe not insurmountable).
3 - One can deploy with a single button press (and there is a system recording what changes were deployed and why, for ops' convenience).
4 - When there's trouble, new deploys can be blocked centrally, and then ops can revert to a previous version with a single button press.
5 - Developers are good about "cleaning up" code that was previously protected by feature flags once the behaviour is standard. (HINT: this is the part Flickr doesn't talk about in public... but as an open source project with more visible dirty laundry, perhaps we can do better.)
This system does result in more "oops" moments. But the point is to make those easy to recover from, and to have a culture where people aren't blamed too much for this. Not to make a system that tries to ensure that deploy branches can be tested to be almost perfect. The real problems are always things that nobody anticipated anyway.
NOTES
[1] I am for the purposes of the argument ignoring MediaWiki as a deliverable and only thinking about project websites.
REFERENCES
Here's the most concise presentation: "Always Ship Trunk: Managing Change In Complex Websites" by Paul Hammond http://www.paulhammond.org/2010/06/trunk/alwaysshiptrunk.pdf
And a longer talk about all this from Paul Hammond and John Allspaw 10+ Deploys Per Day: Dev/Ops Cooperation at Flickr http://velocityconference.blip.tv/file/2284377/
Blog post about the Feature Flag system by Ross Harmes "Flipping out" http://code.flickr.com/blog/2009/12/02/flipping-out/