Optimizing the deployment train schedule

List overview All Threads
Download

newer

older

Architecture Summit: participants...

Possibly affecting our multimedia...

Erik Moeller

19 Oct 2013 19 Oct '13

12:26 a.m.

Hi folks, after speaking to a few folks, I'd like to check in on the WMF deployment train schedule overall, and see if there are ways to optimize it. (Note: In the below I refer to "test wikis" vs. "production wikis", generously including mediawiki.org as a test wiki. I realize that our "test wikis", with the exception of Labs wikis, run on the production cluster.) == Current practice == * On Thursdays we increase the release counter, and deploy the latest release to test wikis and the previous one to Wikipedias. * On Mondays we deploy the latest release to non-Wikipedias. == Problems with this approach == * We only have bits of Thursday and all of Friday to resolve issues that are surfaced in the test wikis prior to the Monday rollout to the first production wikis. * Having two stages of release also increases the cognitive load on developers in understanding when their code hits production wikis, which arguably increases the risk of negative impact of a deploy going unnoticed. == Advantages of this approach == * Commons serves just about enough traffic to sometimes act as a useful canary for performance/scaling issues that will later appear in production. * Developers have some post-deployment time to fix issues highly specific to the non-Wikipedia wikis (e.g. extensions & gadgets only deployed there) rather than being distracted by firefighting on Wikipedia == Some options == Option A: Change nothing. I've not heard from enough folks to see if the problems above are widely perceived to _be_ problems. If the consensus is that current practice, for now, is the best possible approach, obviously we should stick with it. Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment. Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. Are there other ways to optimize / issues I'm missing or misrepresenting above? Thanks, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Show replies by date

James Forrester

19 Oct 19 Oct

1:18 a.m.

On 18 October 2013 15:26, Erik Moeller <erik(a)wikimedia.org> wrote:

...

Hi folks, after speaking to a few folks, I'd like to check in on the WMF deployment train schedule overall, and see if there are ways to optimize it.

[Snip] I think Option B is a good option, and agree that it's a good think that it forces us to have more discipline in the code that goes out in terms of testing/scaling/happiness, rather than spotting issues in production and using the "sister projects" as guinea pigs. I'd note that this is effectively what we've had with VisualEditor since the beginning of deployment train releases in May last year (before the switch to weekly releases): because VE isn't deployed to any of the "sister projects", we go live each Thursday to phase 0, and with the previous version to phase 2; no wikis that get new code on Monday currently have VE enabled. J. -- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforrester(a)wikimedia.org | @jdforrester

Rob Lanphier

2:39 a.m.

Hi Erik, I'm not a fan of removing one of the stages of our current deployments. More inline: On Fri, Oct 18, 2013 at 3:26 PM, Erik Moeller <erik(a)wikimedia.org> wrote:

...

Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment.

The Monday deploy is where we catch load based issues in a way that's not absolutely crushing. The cumulative traffic of the wikis is approximately 10% of our overall traffic, which is large enough to notice load-based problems, but small enough to make the difference between "hmm, we seem to have a load issue" to "oh crap, we just brought down the site". We also generally discover many more issues through getting it in front of more people, but not foisting it on everyone. It's not great that there are bugs that some people have to suffer through, but it's better than making all people suffer through them. We can change the mix of wikis so that it's not always the same set that's part of the pilot group (and maybe one day in the glorious future be able to do mixed versioning on a per-wiki basis so that people could opt-in), but I'd rather not foist everything on everyone at once. Finally, another advantage of staging things this way is that we get some time to focus on non-Wikipedia sister project bugs before we deploy to Wikipedia. There are often project-specific bugs, and our test infrastructure isn't *nearly* built out enough to catch even the majority of them. If we deploy to all projects at once, we get hit with all of the bugs at once.

...

Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic.

I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway. Rob

MZMcBride

3:41 a.m.

Rob Lanphier wrote:

...

I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway.

I agree. Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on <https://wikitech.wikimedia.org/wiki/Deployments>) about how the deployments process(es) are working. MZMcBride

Chris Steipp

5:56 a.m.

On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride <z(a)mzmcbride.com> wrote:

...

Rob Lanphier wrote:

I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway.

I deploy, both extension and "the train" in rare cases. I would definitely vote for A or C. Although I'd like to think that option B would force better pre-cluster testing, I think a lot of that "desirable" pain would be entirely focused on the person doing the deploy (or the rest of the platform team, who get pulled in when things go really bad), and not on the developer/team who caused the issue.

...

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Arthur Richards

21 Oct 21 Oct

7:26 p.m.

On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride <z(a)mzmcbride.com> wrote:

...

Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on <https://wikitech.wikimedia.org/wiki/Deployments>) about how the deployments process(es) are working.

Personally, I do not have a strong opinion about this yet. The mobile web team just got on the deployment train two weeks ago (previously we managed our own weekly deployments that went out cluster-wide), so it feels too early for me have a sense of what works well/doesn't as we're still working out some internal kinks and getting used to the new rhythm. That said, 'option c' seems really sensible to me - it would be nice to have the extra working day to address issues that cropped up on the testwikis before pushing changes out to the non-wikipedia wikis. -- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

Jon Robson

10:36 p.m.

Having mobile just joined it the only feedback I can give so far is it is confusing knowing what is where but I'm not quite sure how to improve that confusion yet other than having a gerrit page which tells me what is deployed everywhere so i can check out the state of mediawiki.org or en.wiki when debugging issues. In case it is useful I drew a tube map after a quick chat with Greg to describe the deployment train process: https://commons.wikimedia.org/wiki/File:Deployment_train_tube_map_for_Media… On Mon, Oct 21, 2013 at 10:26 AM, Arthur Richards <arichards(a)wikimedia.org> wrote:

...

On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride <z(a)mzmcbride.com> wrote:

-- Jon Robson http://jonrobson.me.uk @rakugojon

C. Scott Ananian

7 Nov 7 Nov

6:27 p.m.

On Mon, Oct 21, 2013 at 4:36 PM, Jon Robson <jdlrobson(a)gmail.com> wrote:

...

It seems to me that having a gerrit (or other) page somewhere which lists exactly what is currently deployed where (and when the next scheduled deploy is) is a prerequisite for all of the more aggressive "let's mix up the set of wikis in early deploy" suggestions. --scott -- (http://cscott.net)

Jon Robson

6:30 p.m.

C. Scott Ananian

6:46 p.m.

Even us technical folks are often ignorant of the deeper ways of ops. When I last fixed a long-standing bug in the PHP parser with the potential to cause regressions in existing wikitext, it was not exactly trivial to keep track of where the code was currently live (and exactly when it went live) -- complicated by the fact that I was convinced that the code *wasn't* actually in production, despite all evidence to the contrary, because HTML Tidy was turned on in production and hid the beneficial effects of my patch. That's just anecdotal evidence of the fact that making deployment/version info as obvious as possible can be useful even for "ordinary bug fixers". --scott On Thu, Nov 7, 2013 at 12:30 PM, Jon Robson <jdlrobson(a)gmail.com> wrote:

...

That would indeed be useful C. Scott. Actually the people that seem to care most about what is currently deployed where are product owners and designers from my experience who are not usually technical. It would be good to give them an easy way to look this up as I spend a lot of time debugging why something is not live... _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

Greg Grossmeier

7:07 p.m.

...

On Mon, Oct 21, 2013 at 4:36 PM, Jon Robson <jdlrobson(a)gmail.com> wrote:

Best we have right now is a combination of: The wmfXX release notes pages (autogenerated with love by Reedy). eg: https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf2 The "Included In" dropdown in Gerrit. eg, go to https://gerrit.wikimedia.org/r/#/c/93980/, click "Included in" see a list of "Master" and "wmf/1.23wmf3". Now, you need to correlate 1.23wmf3 and: https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap#Schedule_for_the_depl… (which is updated by hand by mostly Reedy and sometimes me) Yeah, not elegant at all. Who wants to devote some time to making a nice purty dashboard for this info? :) Greg

...

--scott -- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Quim Gil

7:12 p.m.

On 11/07/2013 10:07 AM, Greg Grossmeier wrote:

...

Who wants to devote some time to making a nice purty dashboard for this info? :)

It's probably a bit late for this northernHemisphere(Winter), but... You can trust that someone will take it from here in the next 6 months, or you can write one paragraph and a related enhancement request in Bugzilla, and post it at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects Yes, I know you know but, you know... :) -- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

Greg Grossmeier

8:41 p.m.

New subject: Dashboard for code deployments; where, when, what? (was: Optimizing the deployment train schedule)

...

You can trust that someone will take it from here in the next 6 months, or you can write one paragraph and a related enhancement request in Bugzilla, and post it at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects

Wasn't sure where to put it there, and I wanted it to exist as a project that anyone (not just mentees) could work on, so I created: https://www.mediawiki.org/wiki/Wikimedia_Release_%26_QA_Team/Wishlist Feel free to include wherever :) Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

MZMcBride

8 Nov 8 Nov

1:11 a.m.

New subject: Dashboard for code deployments; where, when, what? (was: Optimizing the deployment train schedule)

Greg Grossmeier wrote:

...

Both this reply and your previous reply give the impression that you have no interest in directly working on a code deployment dashboard yourself. Perhaps a stupid question, but why is that? Sam seems to have the deployments to the wikis under control and the Marks seem to have the third-party releases under control. This seems like a good project for you. MZMcBride

Greg Grossmeier

6:11 a.m.

New subject: Dashboard for code deployments; where, when, what? (was: Optimizing the deployment train schedule)

...

Both this reply and your previous reply give the impression that you have no interest in directly working on a code deployment dashboard yourself. Perhaps a stupid question, but why is that?

I have interest, just right now not much time. I would love to just do it but realistically won't get anything significant any time soon.

...

Sam seems to have the deployments to the wikis under control and the Marks seem to have the third-party releases under control. This seems like a good project for you.

Along with everything else, yeah. ;) Thanks for the tip. :) I'm also looking into reusing something like the Etsy pushbot for helping with the [[wikitech:Lightning deploys]] when they get busy. But this was already on my 'task list' (I use taskwarrior): greg@x200s:~$ task list ID Proj Pri Due Age Description 23 wmf.releng M 11/13/2013 1d design what I want from a dashboard That's referring to a dashboard for a number of things; not just where the code is when. Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Antoine Musso

9:31 a.m.

Le 07/11/13 18:27, C. Scott Ananian a écrit :

...

Whenever someone comes with a script, I will be happy to integrate it so it is continuously generated whenever a change is merged in wmf branches. The resulting output can be hosted at https://integration.wikimedia.org/dashboard/ -- Antoine "hashar" Musso

Brian Wolff

19 Oct 19 Oct

7:51 p.m.

On 2013-10-18 9:40 PM, "Rob Lanphier" <robla(a)wikimedia.org> wrote:

...

Hi Erik, I'm not a fan of removing one of the stages of our current deployments. More inline: On Fri, Oct 18, 2013 at 3:26 PM, Erik Moeller <erik(a)wikimedia.org> wrote: > Option B: No Monday deploy. This would mean we'd have to improve our > testing process to catch issues affecting the non-Wikipedia wikis before > they hit production. I personally think getting rid of the Monday deploy > could create some _desirable_ pain that would act as a forcing function

...

> improve pre-release test practices, rather than using production wikis

...

test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment.

maybe

...

one day in the glorious future be able to do mixed versioning on a

per-wiki

...

basis so that people could opt-in), but I'd rather not foist everything on everyone at once. Finally, another advantage of staging things this way is that we get some time to focus on non-Wikipedia sister project bugs before we deploy to Wikipedia. There are often project-specific bugs, and our test infrastructure isn't *nearly* built out enough to catch even the majority of them. If we deploy to all projects at once, we get hit with all of the bugs at once. > Option C: Shift Monday deploys to Tuesday. This would at least give us

...

> additional work day to fix issues that have occurred in testing before

they

...

hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic.

I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often

have

...

to reschedule these for Tuesday anyway. Rob _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tuesdays are also nice as that gives a day for bugs filed by a user on a weekend to be found/triaged by someone, and the correct person notified before the next stage of deploy. As a user I have vauge memories of the site going down much more often in the past due to performance issues, which doesn't seem to happen anymore with the split up deploy. Our ability to do effective load testing prior to a deploy is essentially zero other than reading code afaik, and I have yet to hear any proposals to change that. I don't think the pain points caused would actually get fixed. (Ok, I guess comparing profiling data of the testwikis before and after deploy carefully can reveal performance issues, but I still think one has to actually test with high load to see the high load issues) -bawolff

Antoine Musso

9:35 p.m.

Le 19/10/13 00:26, Erik Moeller a écrit :

...

Are there other ways to optimize / issues I'm missing or misrepresenting above?

Hello, As a summary we deploy a new release in three stages spanned over a one week window. The last stage of the previous window occurring the same day as the first stage of the next window. The three stages are: 1) test wikis (ie mediawiki) 2) non-wikipedias 3) wikipedias The stages are scheduled as: Thursday window 1 stage 1 Monday window 1 stage 2 Thursday+7 window 1 stage 3, window 2 stage 1 Monday window 2 stage 2 ... What about doing all three stages the same day? We could take advantage of our 18 hours presence from Europe to San Francisco. Hence we could go with something like: 8:00 UTC (1am PST): deploy on test wikis (Europe folks) 16:00 UTC (9am PST): deploy non wikipedias (Europe, East Coast + SF) 20:00 UTC (1pm PST): deploy on wikipedias (East Coast + SF) European folks would catch issues appearing on test wikis, the non wikipedias could be done with Europe+SF and the wikipedias by SF. We also have ops coverage on all that time frame. With such a system, we could keep deploying on Thursdays and Mondays, though we will deploy two releases per weeks. Evil plan: deploy automatically on merge. But we are not ready yet :-] -- Antoine "hashar" Musso

Chris McMahon

21 Oct 21 Oct

3:26 p.m.

On Sat, Oct 19, 2013 at 1:35 PM, Antoine Musso <hashar+wmf(a)free.fr> wrote:

...

Le 19/10/13 00:26, Erik Moeller a écrit :

Are there other ways to optimize / issues I'm missing or misrepresenting above?

Evil plan: deploy automatically on merge. But we are not ready yet :-]

We're not ready-- except in the beta cluster we are. The earlier that changes are merged to the master branch, the more time we have for scrutiny of those changes in beta labs, and the deployment there is in fact all automated and hands-off. I still occasionally see code being merged to master very shortly before being deployed, which means that beta gets updated at about the same time as the test wikis, which occasionally causes surprises. -Chris

Greg Grossmeier

19 Oct 19 Oct

11:43 p.m.

Hi there, tldr; I like a modified Option C, but also propose a very different Option D that I think would also be good, either now or as the next next step. <quote name="Erik Moeller" date="2013-10-18" time="15:26:16 -0700"> [snip overview of problem, combined with Robla's and you get a good picture of the issues.]

...

== Some options == Option A: Change nothing. I've not heard from enough folks to see if the problems above are widely perceived to _be_ problems. If the consensus is that current practice, for now, is the best possible approach, obviously we should stick with it.

I think this is a non-option, honestly. The current schedule has issues that can be resolved; let's try to resolve them.

...

Due to the concerns raised by Robla (and I, when in person), I'm not sure this is the right way to go next. It might be an option later when our cycle is a matter of a day or two, but not now with the week-long cycle.

...

I like this option as a next step, but with a caveat/suggestion: we mix up the wikis in stage 0, 1, and 2. And, we should be open to changing the mix more frequently and based on community feedback (I know some are actually willing/wanting to join the fun of being earlier in the cycle...). Until we have a way to gradually increase the % of users who are using the new wmf *cross wiki*, then our only option is doing things per wiki, which gives you two conceptual options: a test/production split, and that's it, or a tiered system like the 3-tier one we have now. I have two suggestions; a safe one and a less safe one (where 'safe' being 'easy to sell to people'): 1) the safe one: We move Monday's deploy to Tuesday. Let some wikis move into phase 1 from phase 2, and some move from phase 1 to phase 2 (but probably keep phase 0 the same unless some community is as crazy as mw.org's ;) ). This will give more agency to communities on their placement in the cycle while still giving us a more thorough load test on Tuesday after blatant issues are found on Thur/Fri. 2) the less safe one (Option D): We have a four-tiered system. tier0 on Mon, tier1 on Tue, tier2 on Wed, tier3 on Thurs, on Friday we rest (er, merge into master for Monday). Ideal breakdown of user load (of total cross cluster) would be something like: tier0:5% (5% total) tier1:20% (25% total) tier2:30% (55% total) tier3:45% (100%) This gives us: increasing load, with more measurable moments in time. What I mean by that is: With Ori's awesome new work (and planned work), we'll be able to make more sense of performance/load pre/post a deploy. We already look at 500s and similar logs, but those are lumped in the 'apparent bugs' that are found right after a deploy (along with obvious "this button went missing" things). With only a 3 tier system, where the first tier is basically so small it is hard to tell signal from noise in pre/post deploy performance data. We still only get one chance to test load (tier1, non-wikipedias now) before going everywhere and potentially having downtime. I argue/theorize, that with 3 deploys before we get to everywhere, we would be better able to spot performance issues. Now, we can't probably do that idealized load distribution I lay out above. See: http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjectsOriginal.htm for the breakdown per project type. Also (for the Wikpedia's breakdown): http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm <insert time where Greg goes off to sift through data> Ok, I'm going to have to sit down with this data on Monday (this current naptime session won't be long enough) and come back with a proposed distribution. Simply: I'll try to hit the above idealized breakdown, but with these restrictions: A) ENWP in tier3 (which is 44% by itself, using Sept'13 data); B) for tiers 1 and 2, get a mix of project types (ie: include WPs, wikibookos, wiktionaries, etc in both); and C) tier0 being only testwikis (and mw.org). But leave this open for others to join, if desired. Other benefits of Option D: * gets us accustomed to more frequent deploys. * will provide some of that beneficial pain Erik mentions (which is something I want as well, but only if intelligently planned pain) * Is easier to conceptually understand (a growing release each week, with Fridays off). We'd of course have a page per tier with the current list of wikis in that tier (shouldn't change all that often) so people can answer "is X language project on the new release yet?". * Obvious next step towards continuous from here is 2 day cycles twice a week, which is basically Option B on steroids. == CONCLUSION! == If Option D doesn't sit well with people, let's go with a modified Option C. Ok, wall of text is sufficiently long... Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Greg Grossmeier

7 Nov 7 Nov

5:47 a.m.

...

tldr; I like a modified Option C, but also propose a very different Option D that I think would also be good, either now or as the next next step.

This Monday is a US Holiday, so no deploys that day. Seems like a reasonable week to start on the Option C modification (ie: move Monday's deploy to Tuesday). Let's do that. I'd still like to move around the wikis in the various groups/phases, but that can wait (and will need to, as we need to see which ones want to move where). Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

3821

days inactive

3842

days old

wikitech-l@lists.wikimedia.org

Manage subscription

20 comments

13 participants

tags (0)

participants (13)

Antoine Musso
Arthur Richards
Brian Wolff
C. Scott Ananian
Chris McMahon
Chris Steipp
Erik Moeller
Greg Grossmeier
James Forrester
Jon Robson
MZMcBride
Quim Gil
Rob Lanphier