Hello!
While looking at the elasticsearch dashboard on Grafana [1] I see that we have weekly spikes in response times from codfw. My guess is that this is related to the weekly update of page rank.
More details:
We see fairly large spikes on the overall 95%-ile for codfw (from a usual ~300[ms] to ~1-1.5[s]). Those spikes are more visible on codfw than on eqiad as we have less overall traffic on codfw compared to eqiad. This makes indexing more visible compared to reads. So far, no problem, the graph look bad, but this can be explained and does not show user impact.
We also see weekly spikes on the 75%-ile of more-like queries (from a usual ~200-300[ms] to 300-400[ms]). More-like queries are the only queries sent to codfw. This is not yet worrisome, but is probably something we should keep an eye on and improve before it starts to be an issue.
I have mostly no idea how those page rank updates work. Would it be possible to throttle the index update from those jobs? Increase the frequency of those update to reduce the impact?
Idea welcomed...
Guillaume
[1] https://grafana-admin.wikimedia.org/dashboard/db/elasticsearch-percentiles