[X-posting to ops as this discussion is relevant there too]
On Wed, Feb 17, 2016 at 5:53 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
On Feb 17, 2016 1:50 AM, "Guillaume Lederrey" glederrey@wikimedia.org wrote:
Hello team! == Versionning ==
**my belief ** anything deployed must have a version number
** what happens at WMF **
- deployments on labs are pretty much free-form, cherry pick whatever
you want on puppetmaster
- deployments on prod seems to have version numbers at least for
mediawiki code, puppet code is deployed directly from production branch
** comments ** Having clear version numbers implies having a conscious decision of creating a version, potentially with the appropriate checks of the content of that version, additional testing. It allows to have a clear separation between creating a version and promoting it to production. Not having versions everywhere allows for more flexibility and puts responsibility of making the right choices more on the people than on the process. Probably a good thing if you have smart enough people (and WMF seems to have a pretty smart crowd).
Having a shared git repository on deployment-puppetmaster scares the hell out of me! I'm so used to preparing anything I want to push locally and then just applying a specific tag / version...
Puppet being unversioned certainly makes it different from the rest of deployments. I think ops gets away with this by having relatively few people commiting code. It also has to do with the careful nature of puppet deployments, puppet is typically deployed one patch at a time. I think this helps with understanding what just broke everything, rather than having a big release with many disparate changes.
Puppet is _always_ deployed one patch at a time unless for very very special cases; I do think it's a very good thing for operations: there are a few reasons why it's a good thing:
1) Minimize change risk/surface: given we're a very high traffic website with a mildly complex architecture, you can't realistically think you can validate a large set of changes without throwing live traffic at them. I've see ops teams working with stricter change management strategies and the risk for *big troubles* has always been higher. 2) Speed of deployment: we're a very small team for the amount of things we're doing in parallel. We can't seriously think to keep up the pace with a stricter change management (as in, deploy a new version of our puppet code N times a week after rigorous testing and picking the changes that make the cut). 3) Keeping changes independent: since the puppet repo is large and includes all of production, having changes to independent systems being tied together is a recipe for disaster: rolling back one change would mean rolling back all of them, frustrating a lot of people and probably requiring coordination with other teams. You could just revert the affected change and make a new point release, but then I miss completely how having releases does us any good.
About cherry-picks in beta: the problem is not cherry-picking (I think it's a reasonable way to test things) but persistent cherry-picking to monkey patch problems is. I think if we follow the flow of:
- writing a patch - testing it on beta with a cherry-pick - get it merged on ops/puppet and production
and all of this happens within a week, it would be a decent compromise.
- I still have not found a global architecture schema (something like
a high level component or deplyoment diagram). But I have never seen any company having those...
Pretty sure one doesn't exist :(
Luca (the new analytics opsen) has started to work on https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png
I asked him to share the sources for it so that everyone can improve it.
Also, if you need some oral history, just ask opsens and we'll be happy to give you an overview of how things work :)
Cheers,
Giuseppe