Re: [Wikitech-l] [Engineering] Testing and monitoring deployed code

24 Apr 2013


      On Tuesday, April 23, 2013 at 1:06 PM, Greg Grossmeier wrote:
...
Hello all,
This message is for those of you who do deployments to the WMF cluster.
On the [[How to deploy code]] wikitech page, there is a section on
Testing your live code:
https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
That's a pretty basic overview of it and it could be greatly improved
with information like:

How to monitor specific parts of the cluster that are relevant to what

you deployed

What general monitoring should be looked at after you deploy

MediaWiki exceptions / fatals are plotted in Ganglia now, though somewhat awkwardly under node vanadium.eqiad.wmnet (where they're getting tallied) rather than the node on which the error originated. I think the way it's done now deserves another thought (maybe this ought to go in graphite, instead?), but at the same time it is sufficiently intelligible to be of _some_ use, I think.
The most useful view is the last two hour's worth of exceptions and misc. fatals (evergreen link):
http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=M...
(The m is 'mili', so the current peaks correspond to one exception / fatal every 6-10 seconds.)
I'll add it to the post-deployment instructions if people find it useful.
--
Ori Livneh

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Engineering] Testing and monitoring deployed code