Hi Gilles,
Thanks for digging up all these graphs. This is thorough work and truly
excellent preparation, kudos!
I agree that we seem to be doing okay so far, indeed.
On Fri, May 02, 2014 at 11:38:29AM +0200, Gilles Dubuc wrote:
Are these the right graphs to look at to see if these
APIs aren't going
nuts and won't take down the servers when we release to bigger wikis?
On a related note, is this the right dashboard for API servers?
http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_repo…
Yes, these are the right graphs and the Ganglia cluster "API Application
servers eqiad" is the one to monitor indeed. From that group, the most
interesting metrics would be the ap_rps (Apache Requests per Second) and
ap_busy_workers:
http://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20applic…
API is being served from the main Varnish clusters ("Text caches
eqiad/esams/ulsfo"), so you wouldn't have a separate group to monitor
there and the data will incorporate a lot of noise. The
frontend.client_req and varnish.client_req metrics would be the ones to
monitor there.
Also, considering the nature of the feature and the need for newly
generated thumbs (AIUI) we should watch carefully:
a) Swift, in particular rps,
b) Imagescalers, in particular rps,
c) Front/back Upload Varnishes.
All these are at Ganglia's Media Storage view:
https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&…
Finally, this falls a bit outside of ops, but it ties closely to the
discussion about cached API responses, as it involves the (lack of) CDN
for these requests: we should assess the effect that the feature has on
frontend metrics, NavigationTiming such. Gdash has a dashboard with some
high-level graphs for that that I don't think are going to be very
useful.My understanding is that you were also doing some work in this
area already, though? I vaguely remember some NavTiming/EventLogging
work from the Multimedia team, is this correct?
Thanks,
Faidon