Hello,
Both hosts are back in sync
Manuel.
On Thu, Oct 1, 2020 at 7:19 AM Manuel Arostegui marostegui@wikimedia.org wrote:
Hello,
Labsdb1011 has recovered, I have repooled it. Labsdb1010 is lagging a bit behind, but I am going to repool it with its normal weight, and keeping the query killer to 1800 seconds until it fully recovers from helping labsdb1011.
Manuel.
On Wed, Sep 30, 2020 at 7:27 AM Manuel Arostegui marostegui@wikimedia.org wrote:
Hello,
This is a heads up about the current situation with s4 (commons) and labsdb.
There's been more activity lately on s4, and that had made labsdb1011 (analytics role) start lagging behind.
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&...
I have tried to ease its weight a couple of days ago, to help it recovering: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630392 https://gerrit.wikimedia.org/r/c/operations/puppet/+/630531 https://gerrit.wikimedia.org/r/c/operations/puppet/+/630770
The last change has (as sort of expected) made labsdb1010 lag:
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&...
I am going to decrease the pt-kill query time from 3600 to 1800 to see if that helps labsdb1010 to guard the fort a bit.
There's not much else we can do at the moment, but just keep all these issues in mind if people complain about lag on s4 (commons) on the analytics role. The web role is doing fine (labsdb1009 isn't lagging).
Manuel.