One thing that impressed me when I started working with WMF is that reverting in production is as safe as I have ever seen any production environment. In the 20 months or so I've been here, I think I only remember one change that left behind corrupt data in prod, and that change was made by a volunteer, the bug was manifested in beta labs but we failed to recognize the importance of the bug, and then the change to the code was merged on Thanksgiving Day by someone not on the team affected by the change-- one of those perfect storm sort of problems.
We're good at reverting.
On Thu, Oct 31, 2013 at 12:26 PM, Toby Negrin tnegrin@wikimedia.org wrote:
How easy is it to rollback production changes? Is this something that can be consistently done easily with our current tools. At other high traffic sites I've worked at this has been an important component of production engineering.
-Toby
On Wed, Oct 30, 2013 at 6:12 PM, Greg Grossmeier greg@wikimedia.orgwrote:
First: Thanks for responding to this and writing it up.
<quote name="Yuri Astrakhan" date="2013-10-31" time="04:53:44 +0400"> > == Recomendations == > * Allow a bit more time between deployments and observe fatalmonitor before > and after
Agreed.
I put a ton of blame on myself for not slowing down the cadence of LD slots when a bunch of people are trying to get in on the same day.
For future LDs I am going to explicitly ask everyone to do what Yuri suggests (monitor fatals after your deploy) before saying that you're done. 5 minutes post-deploy of watching the fatalmonitor isn't unreasonable, I don't think.
Relatedly, I think we should reassess the Lightning Deploys :)
Not necessarily to get rid of them (probably not), but:
how many deploys can go in one LD? How many do we *want* to go?
from 1, is the length of the LD long enough/too long?
LD management is still pretty high-communication ("Alright, who's in
line? Who's up next? Are you done yet?") There are basic tools that can help with this (Etsy has an IRC "pushbot" that manages the queue mostly automatically, for instance); I'll look into those/test them.
- probably more, aka: your thoughts?
Greg
PS: graph of the fatals attached, just for completenesses sake.
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering