Hi Gilles,
Thanks for digging up all these graphs. This is thorough work and truly excellent preparation, kudos!
I agree that we seem to be doing okay so far, indeed.
On Fri, May 02, 2014 at 11:38:29AM +0200, Gilles Dubuc wrote:
Are these the right graphs to look at to see if these APIs aren't going nuts and won't take down the servers when we release to bigger wikis?
On a related note, is this the right dashboard for API servers? http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_repor...
Yes, these are the right graphs and the Ganglia cluster "API Application servers eqiad" is the one to monitor indeed. From that group, the most interesting metrics would be the ap_rps (Apache Requests per Second) and ap_busy_workers: http://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20applica...
API is being served from the main Varnish clusters ("Text caches eqiad/esams/ulsfo"), so you wouldn't have a separate group to monitor there and the data will incorporate a lot of noise. The frontend.client_req and varnish.client_req metrics would be the ones to monitor there.
Also, considering the nature of the feature and the need for newly generated thumbs (AIUI) we should watch carefully: a) Swift, in particular rps, b) Imagescalers, in particular rps, c) Front/back Upload Varnishes. All these are at Ganglia's Media Storage view: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&v...
Finally, this falls a bit outside of ops, but it ties closely to the discussion about cached API responses, as it involves the (lack of) CDN for these requests: we should assess the effect that the feature has on frontend metrics, NavigationTiming such. Gdash has a dashboard with some high-level graphs for that that I don't think are going to be very useful.My understanding is that you were also doing some work in this area already, though? I vaguely remember some NavTiming/EventLogging work from the Multimedia team, is this correct?
Thanks, Faidon