why we had slowdowns - Wikitech-l

13 Feb 2009


      Hello,
probably it is worth telling why we had this headless chicken run  
lately with all these new servers, and why we didn't do this slowly  
but surely before all the slowdowns hit us.
We use Ganglia to understand cluster capacity, and the major overview  
place is at:
http://ganglia.wikimedia.org/pmtpa/?gw=fwd&gs=Wikimedia%40http%3A%2F%2Fg...
One of things that distorted our understanding was that aggregate  
graph didn't exclude servers that were out of rotation for one reason  
or another, so besides highly-loaded servers in the average  
calculations, we had 0 load in the mix - thus showing quite some white  
space in the aggregate. Fixing that immediately showed somewhat worse  
situation than it used to look before (though looking at per-host  
statistics already showed the problem).
Other issue is that any long-term graph (say monthly or yearly) shows  
averages that do not represent peak time load properly.
So we do see average increase, but it does not look _that_ frightening.
Then... we had _sharp_ CPU usage increase three weeks ago. That still  
needs investigation - but this could be anything, from some evil  
metatemplate introduced on major wiki (maybe {{cite}} stuff  
changed? :) to some bot that is hitting slower code paths to simply  
bad code.
So, from operations perspective, it would be really nice to have:
a) A long-term data collection of maximum CPU load values (uhm, say,  
maximum hourly averages).
b) Graphing / longterm data collection for profiling points
c) Ability to profile template costs/impacts apart from general Parser  
profiling.
d) Alarm when we hit something above threshold. We have to notice that  
within hour, not within days. :)
Or of course, being more attentive helps too.
Cheers,
-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]