Hello, 

Labsdb1011 has recovered, I have repooled it.
Labsdb1010 is lagging a bit behind, but I am going to repool it with its normal weight, and keeping the query killer to 1800 seconds until it fully recovers from helping labsdb1011.

Manuel.

On Wed, Sep 30, 2020 at 7:27 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:
Hello, 

This is a heads up about the current situation with s4 (commons) and labsdb.

There's been more activity lately on s4, and that had made labsdb1011 (analytics role) start lagging behind.
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&from=now-7d&to=now&var-server=labsdb1011&var-port=9104

I have tried to ease its weight a couple of days ago, to help it recovering:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630392
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630531
https://gerrit.wikimedia.org/r/c/operations/puppet/+/630770

The last change has (as sort of expected) made labsdb1010 lag: 
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&from=now-7d&to=now&var-server=labsdb1010&var-port=9104

I am going to decrease the pt-kill query time from 3600 to 1800 to see if that helps labsdb1010 to guard the fort a bit.

There's not much else we can do at the moment, but just keep all these issues in mind if people complain about lag on s4 (commons) on the analytics role.
The web role is doing fine (labsdb1009 isn't lagging).

Manuel.