First: Thanks for responding to this and writing it up.
<quote name="Yuri Astrakhan" date="2013-10-31" time="04:53:44 +0400">
== Recomendations ==
- Allow a bit more time between deployments and observe fatalmonitor before
and after
Agreed.
I put a ton of blame on myself for not slowing down the cadence of LD slots when a bunch of people are trying to get in on the same day.
For future LDs I am going to explicitly ask everyone to do what Yuri suggests (monitor fatals after your deploy) before saying that you're done. 5 minutes post-deploy of watching the fatalmonitor isn't unreasonable, I don't think.
Relatedly, I think we should reassess the Lightning Deploys :)
Not necessarily to get rid of them (probably not), but: 1) how many deploys can go in one LD? How many do we *want* to go?
2) from 1, is the length of the LD long enough/too long?
3) LD management is still pretty high-communication ("Alright, who's in line? Who's up next? Are you done yet?") There are basic tools that can help with this (Etsy has an IRC "pushbot" that manages the queue mostly automatically, for instance); I'll look into those/test them.
4) probably more, aka: your thoughts?
Greg
PS: graph of the fatals attached, just for completenesses sake.
wikitech-l@lists.wikimedia.org