Hello!
While looking at the elasticsearch dashboard on Grafana [1] I see that
we have weekly spikes in response times from codfw. My guess is that
this is related to the weekly update of page rank.
More details:
We see fairly large spikes on the overall 95%-ile for codfw (from a
usual ~300[ms] to ~1-1.5[s]). Those spikes are more visible on codfw
than on eqiad as we have less overall traffic on codfw compared to
eqiad. This makes indexing more visible compared to reads. So far, no
problem, the graph look bad, but this can be explained and does not
show user impact.
We also see weekly spikes on the 75%-ile of more-like queries (from a
usual ~200-300[ms] to 300-400[ms]). More-like queries are the only
queries sent to codfw. This is not yet worrisome, but is probably
something we should keep an eye on and improve before it starts to be
an issue.
I have mostly no idea how those page rank updates work. Would it be
possible to throttle the index update from those jobs? Increase the
frequency of those update to reduce the impact?
Idea welcomed...
Guillaume
[1]
https://grafana-admin.wikimedia.org/dashboard/db/elasticsearch-percentiles
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation