Just wanted to confirm we're seeing these mails and appreciate you keeping us in the loop on the changes. This is helpful for us to be able to respond to user feedback or concerns.

On Tue, Oct 6, 2020 at 1:41 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:

I have repooled labsdb1011, but I am sure it will get lagged again, so I am leaving the weights this way:

labsdb1009: 2
labsdb1010: 1

labsdb1011: 1
labsdb1010: 1

So labsdb will serve less on web service and will help equally on analytics.

WMCS, I am not sure if you are receiving any of these emails (as they are sent to your admin list, maybe I am being moderated?), but any thoughts on all this?


On Mon, Oct 5, 2020 at 7:51 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:

labsdb1011 has kept lagging behind during the weekend. I have depooled it, and I will reshuffle weights again to get labsdb1010 to help more on analytics rather than web service once labsdb1011 is back in syn.


On Fri, Oct 2, 2020 at 3:16 PM Manuel Arostegui <marostegui@wikimedia.org> wrote:

I have pushed this https://gerrit.wikimedia.org/r/c/operations/puppet/+/631768 as labsdb1011 is starting to lag again on s4. There were some heavy queries there...let's see how it goes during the weekend.

On Fri, Oct 2, 2020 at 8:00 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:

Both hosts are back in sync


On Thu, Oct 1, 2020 at 7:19 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:

Labsdb1011 has recovered, I have repooled it.
Labsdb1010 is lagging a bit behind, but I am going to repool it with its normal weight, and keeping the query killer to 1800 seconds until it fully recovers from helping labsdb1011.


On Wed, Sep 30, 2020 at 7:27 AM Manuel Arostegui <marostegui@wikimedia.org> wrote:

This is a heads up about the current situation with s4 (commons) and labsdb.

There's been more activity lately on s4, and that had made labsdb1011 (analytics role) start lagging behind.

I have tried to ease its weight a couple of days ago, to help it recovering:

The last change has (as sort of expected) made labsdb1010 lag: 

I am going to decrease the pt-kill query time from 3600 to 1800 to see if that helps labsdb1010 to guard the fort a bit.

There's not much else we can do at the moment, but just keep all these issues in mind if people complain about lag on s4 (commons) on the analytics role.
The web role is doing fine (labsdb1009 isn't lagging).

Cloud-admin mailing list

Nicholas Skaggs
Engineering Manager, Cloud Services
Wikimedia Foundation