Hello,
Labsdb1011 has recovered, I have repooled it.
Labsdb1010 is lagging a bit behind, but I am going to repool it with its
normal weight, and keeping the query killer to 1800 seconds until it fully
recovers from helping labsdb1011.
Manuel.
On Wed, Sep 30, 2020 at 7:27 AM Manuel Arostegui <marostegui(a)wikimedia.org>
wrote:
Hello,
This is a heads up about the current situation with s4 (commons) and
labsdb.
There's been more activity lately on s4, and that had made labsdb1011
(analytics role) start lagging behind.
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&…
I have tried to ease its weight a couple of days ago, to help it
recovering:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630392
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630531
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630770
The last change has (as sort of expected) made labsdb1010 lag:
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&…
I am going to decrease the pt-kill query time from 3600 to 1800 to see if
that helps labsdb1010 to guard the fort a bit.
There's not much else we can do at the moment, but just keep all these
issues in mind if people complain about lag on s4 (commons) on the
analytics role.
The web role is doing fine (labsdb1009 isn't lagging).
Manuel.