Hello all,
This message is for those of you who do deployments to the WMF cluster.
On the [[How to deploy code]] wikitech page, there is a section on Testing your live code: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
That's a pretty basic overview of it and it could be greatly improved with information like: * How to monitor specific parts of the cluster that are relevant to what you deployed * What general monitoring should be looked at after you deploy
I know many of you already do much of this after you deploy, but the lack of documentation on *how* to do it was a recurring theme in the initial interviews I did with engineering teams when I first started. https://wikitech.wikimedia.org/wiki/Deployments/Features_Process/General_Fee...
== "The Ask" ==
I'm asking you ("you" being those of you who have experience doing post-deploy monitoring) to please add more documentation to this section of the How to deploy code page: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
I expect people from both engineering and ops will have feedback here.
Also, those of you who don't know how to monitor/log things post deploy but you have specific questions, please ask here so that someone who does know can answer on the wiki.
Thanks,
Greg
On Tuesday, April 23, 2013 at 1:06 PM, Greg Grossmeier wrote:
Hello all,
This message is for those of you who do deployments to the WMF cluster.
On the [[How to deploy code]] wikitech page, there is a section on Testing your live code: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
That's a pretty basic overview of it and it could be greatly improved with information like:
- How to monitor specific parts of the cluster that are relevant to what
you deployed
- What general monitoring should be looked at after you deploy
MediaWiki exceptions / fatals are plotted in Ganglia now, though somewhat awkwardly under node vanadium.eqiad.wmnet (where they're getting tallied) rather than the node on which the error originated. I think the way it's done now deserves another thought (maybe this ought to go in graphite, instead?), but at the same time it is sufficiently intelligible to be of _some_ use, I think.
The most useful view is the last two hour's worth of exceptions and misc. fatals (evergreen link):
http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=M...
(The m is 'mili', so the current peaks correspond to one exception / fatal every 6-10 seconds.)
I'll add it to the post-deployment instructions if people find it useful.
-- Ori Livneh
<quote name="Ori Livneh" date="2013-04-23" time="15:23:49 -0700">
I'll add it to the post-deployment instructions if people find it useful.
Just to be explicit: Please do! ;-)
Greg
On Tue, Apr 23, 2013 at 3:23 PM, Ori Livneh ori@wikimedia.org wrote:
I'll add [ganglia exception graphing] to the post-deployment instructions if people find it useful.
Seasoned deployment professionals, please edit the updated https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_and_monitor_your_live_code
I think we can express best practices better than "stay on IRC in case people yell 'Fire!'" (though that's a good start :)
Cheers, -- =S Page software engineer on Editor Engagement Experiments
Hello all,
<quote name="Ori Livneh" date="2013-04-23" time="15:23:49 -0700">
On Tuesday, April 23, 2013 at 1:06 PM, Greg Grossmeier wrote:
On the [[How to deploy code]] wikitech page, there is a section on Testing your live code: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
That's a pretty basic overview of it and it could be greatly improved with information like:
- How to monitor specific parts of the cluster that are relevant to what
you deployed
- What general monitoring should be looked at after you deploy
MediaWiki exceptions / fatals are plotted in Ganglia now, though somewhat awkwardly under node vanadium.eqiad.wmnet (where they're getting tallied) rather than the node on which the error originated. I think the way it's done now deserves another thought (maybe this ought to go in graphite, instead?), but at the same time it is sufficiently intelligible to be of _some_ use, I think.
The most useful view is the last two hour's worth of exceptions and misc. fatals (evergreen link):
http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=M...
(The m is 'mili', so the current peaks correspond to one exception / fatal every 6-10 seconds.)
I'll add it to the post-deployment instructions if people find it useful.
Ori added that, and I believe S Page added some more info to that section.
https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_and_monitor_your...
How does it look? Anyone here have any corrections and/or additions that aren't represented there yet?
Thanks,
Greg
wikitech-l@lists.wikimedia.org