New subject: Update 1 labsdb host to buster and 10.4

30 Mar 2020

      Thanks Bryan - sorry for not answering faster, but looks like you only
replied to cloud-admin and I am not there :-)
Today in our 1:1 this subject came up and Jaime forwarded me the mail, as
he is in cloud-admin hehe.
Answers in line!
On Mon, Mar 30, 2020 at 1:02 PM Jaime Crespo jcrespo@wikimedia.org wrote:
...
---------- Forwarded message ---------
From: Bryan Davis bd808@wikimedia.org
Date: Tue, Mar 24, 2020 at 3:56 PM
Subject: Re: [Cloud-admin] Update 1 labsdb host to buster and 10.4
To: Cloud Services administration and infrastructure discussion
cloud-admin@lists.wikimedia.org
On Tue, Mar 24, 2020 at 2:36 AM Manuel Arostegui
marostegui@wikimedia.org wrote:
...
So far we have had normal 1 instance hosts upgraded, multi-instance (2
mysqld processes) upgraded, and we need to have a multisource (labsdb) host
upgrade, to make sure 10.4 works fine or to know what might need work
(mysqld-exporter https://phabricator.wikimedia.org/T247290 or whatever),
better to know in advance.
...
10.4 also fixes some bugs that are hitting labsdb hosts specifically:

Grants race condition: https://jira.mariadb.org/browse/MDEV-14732
GTID works on multisource: https://jira.mariadb.org/browse/MDEV-12012

this is one of the early bugs we filed with MariaDB almost 3 years ago and
looks like it is now working even though - this requires some work on the
master's side, but my last tests are looking good and if we can enable GTID
on labsdb hosts that'd we be a BIG improvement towards avoiding corruption
during a crash.
These all sound like good things. And thank you very much, seriously,
for the effort you have been putting into thinking about and caring
for the wiki replicas.
You are welcome! :-)
...
...
So, any objections to reimage labsdb1011 as Buster and 10.4 (/srv won't
be formatted, so we don't have to rebuild that host).
Any idea what the roll back plan would look like if it turns out that
something about 10.4 and multisource do not work well together? Would
it be less risky to do labsdb1012 first and see how it works there?
The rollback plan is basically, reimage back to Stretch and reclone from
labsdb1012.
The idea to use labsdb1011 is to actually test 10.4 in this very unique
environment (lots of heavy queries).
Labsdb1012 is barely used, and only a few days during the month, so it
wouldn't be representative. Also, the rollback is easier as we can use
labsdb1012, as it is normally fine to stop it for 24h (as long as it is not
during the few days it is used at the start of each month), so no user
impact there.
Whereas, stopping a wiki replica, means we have to put more pressure on the
other 2 hosts for the time it is stopped.
Does this make sense and answer your question?
Thank you!
Manuel.