Hello,
probably it is worth telling why we had this headless chicken run lately with all these new servers, and why we didn't do this slowly but surely before all the slowdowns hit us.
We use Ganglia to understand cluster capacity, and the major overview place is at: http://ganglia.wikimedia.org/pmtpa/?gw=fwd&gs=Wikimedia%40http%3A%2F%2Fg...
One of things that distorted our understanding was that aggregate graph didn't exclude servers that were out of rotation for one reason or another, so besides highly-loaded servers in the average calculations, we had 0 load in the mix - thus showing quite some white space in the aggregate. Fixing that immediately showed somewhat worse situation than it used to look before (though looking at per-host statistics already showed the problem).
Other issue is that any long-term graph (say monthly or yearly) shows averages that do not represent peak time load properly. So we do see average increase, but it does not look _that_ frightening.
Then... we had _sharp_ CPU usage increase three weeks ago. That still needs investigation - but this could be anything, from some evil metatemplate introduced on major wiki (maybe {{cite}} stuff changed? :) to some bot that is hitting slower code paths to simply bad code.
So, from operations perspective, it would be really nice to have:
a) A long-term data collection of maximum CPU load values (uhm, say, maximum hourly averages). b) Graphing / longterm data collection for profiling points c) Ability to profile template costs/impacts apart from general Parser profiling. d) Alarm when we hit something above threshold. We have to notice that within hour, not within days. :)
Or of course, being more attentive helps too.
Cheers,
wikitech-l@lists.wikimedia.org