Upcoming: Delay in new wiki replicas on s5

List overview All Threads
Download

newer

older

Cron <root@labstore2003>...

...

Manuel Arostegui

16 Nov 2017 16 Nov '17

12:48 a.m.

Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to alter some big tables. One of them is logging, which, for instance, in wikidata takes around 8h. Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and ROW based replication (what we use in sanitariums) this change needs to be done with replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will break replication till the column is added on the labs hosts, and this is less desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will notify when I start it) for the sanitarium host in s5, that means that there will be lag on the labs servers, for a few hours, on the s5 instance (which will also affect s1 and s3 because we are using the same replication thread for those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own replication thread.

Should you have any questions, let me know!

Thanks Manuel.

Attachments:

attachment.htm (text/html — 1.4 KB)

Show replies by date

Bryan Davis

16 Nov 16 Nov

1:39 a.m.

On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui manuel@wikimedia.org wrote:

...

Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to alter some big tables. One of them is logging, which, for instance, in wikidata takes around 8h. Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and ROW based replication (what we use in sanitariums) this change needs to be done with replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will break replication till the column is added on the labs hosts, and this is less desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will notify when I start it) for the sanitarium host in s5, that means that there will be lag on the labs servers, for a few hours, on the s5 instance (which will also affect s1 and s3 because we are using the same replication thread for those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own replication thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Bryan

-- Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855

Manuel Arostegui

1:45 a.m.

On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...

On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui manuel@wikimedia.org wrote:

...
Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to alter

some

...
big tables. One of them is logging, which, for instance, in wikidata takes around 8h. Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and ROW

based

...
replication (what we use in sanitariums) this change needs to be done

with

...
replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will break replication till the column is added on the labs hosts, and this is less desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will notify when I start it) for the sanitarium host in s5, that means that there

will

...
be lag on the labs servers, for a few hours, on the s5 instance (which

will

...
also affect s1 and s3 because we are using the same replication thread

for

...
those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own

replication

...
thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

29 Nov 29 Nov

8:53 p.m.

Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui manuel@wikimedia.org wrote:

...

On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui manuel@wikimedia.org wrote:

...
Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to alter

some

...
big tables. One of them is logging, which, for instance, in wikidata takes around

8h.

...
Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and ROW

based

...
replication (what we use in sanitariums) this change needs to be done

with

...
replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will break replication till the column is added on the labs hosts, and this is less desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will

notify

...
when I start it) for the sanitarium host in s5, that means that there

will

...
be lag on the labs servers, for a few hours, on the s5 instance (which

will

...
also affect s1 and s3 because we are using the same replication thread

for

...
those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own

replication

...
thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

7 Dec 7 Dec

3:38 p.m.

Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui manuel@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui manuel@wikimedia.org wrote:

...
Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to alter

some

...
big tables. One of them is logging, which, for instance, in wikidata takes around

8h.

...
Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and ROW

based

...
replication (what we use in sanitariums) this change needs to be done

with

...
replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will

break

...
replication till the column is added on the labs hosts, and this is

less

...
desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will

notify

...
when I start it) for the sanitarium host in s5, that means that there

will

...
be lag on the labs servers, for a few hours, on the s5 instance (which

will

...
also affect s1 and s3 because we are using the same replication thread

for

...
those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own

replication

...
thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

13 Dec 13 Dec

12:03 a.m.

Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui manuel@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui manuel@wikimedia.org wrote:

...
Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to

alter some

...
big tables. One of them is logging, which, for instance, in wikidata takes around

8h.

...
Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and

ROW based

...
replication (what we use in sanitariums) this change needs to be done

with

...
replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will

break

...
replication till the column is added on the labs hosts, and this is

less

...
desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will

notify

...
when I start it) for the sanitarium host in s5, that means that there

will

...
be lag on the labs servers, for a few hours, on the s5 instance

(which will

...
also affect s1 and s3 because we are using the same replication

thread for

...
those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own

replication

...
thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

18 Dec 18 Dec

11:54 p.m.

Hello again!

I will be altering s1 tomorrow early european morning. Expect some delay on labs!

Manuel.

On Tue, Dec 12, 2017 at 5:03 PM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui <marostegui@wikimedia.org

...
wrote:

...
Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui manuel@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui < manuel@wikimedia.org> wrote:

...
Hello Cloud Admins!

As part of https://phabricator.wikimedia.org/T174569 we have to

alter some

...
big tables. One of them is logging, which, for instance, in wikidata takes

around 8h.

...
Which is the shard I am currently working on.

Because of the nature of the change (some columns being added) and

ROW based

...
replication (what we use in sanitariums) this change needs to be

done with

...
replication (from sanitarium, or their masters, to the labs servers).

This will obviously generate lag and if not done that way, it will

break

...
replication till the column is added on the labs hosts, and this is

less

...
desirable than replication lag.

I am planning to run the alter probably tomorrow or Monday (I will

notify

...
when I start it) for the sanitarium host in s5, that means that

there will

...
be lag on the labs servers, for a few hours, on the s5 instance

(which will

...
also affect s1 and s3 because we are using the same replication

thread for

...
those shards too - which is a FIXME we have pending).

s2, s4, s6 and s7 will remain unaffected as they have their own

replication

...
thread.

Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

3 Jan 3 Jan

9:49 p.m.

Happy new year!

Tomorrow I will deploy this change on s7, so expect some delay there.

Thanks Manuel.

On Mon, Dec 18, 2017 at 4:54 PM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Hello again!

I will be altering s1 tomorrow early european morning. Expect some delay on labs!

Manuel.

On Tue, Dec 12, 2017 at 5:03 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui <manuel@wikimedia.org

...
wrote:

...
On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

...
On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui < manuel@wikimedia.org> wrote: > Hello Cloud Admins! > > As part of https://phabricator.wikimedia.org/T174569 we have to alter some > big tables. > One of them is logging, which, for instance, in wikidata takes around 8h. > Which is the shard I am currently working on. > > Because of the nature of the change (some columns being added) and ROW based > replication (what we use in sanitariums) this change needs to be done with > replication (from sanitarium, or their masters, to the labs servers). > > This will obviously generate lag and if not done that way, it will break > replication till the column is added on the labs hosts, and this is less > desirable than replication lag. > > I am planning to run the alter probably tomorrow or Monday (I will notify > when I start it) for the sanitarium host in s5, that means that there will > be lag on the labs servers, for a few hours, on the s5 instance (which will > also affect s1 and s3 because we are using the same replication thread for > those shards too - which is a FIXME we have pending). > > s2, s4, s6 and s7 will remain unaffected as they have their own replication > thread. > > Should you have any questions, let me know!

Should we send a message to cloud-announce about this, or just be ready to tell people that the lag is a known issue due to production schema changes?

Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

10 Jan 10 Jan

8:23 p.m.

Hello,

I will alter s5 tomorrow. Expect some delay there.

On Wed, Jan 3, 2018 at 2:49 PM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Happy new year!

Tomorrow I will deploy this change on s7, so expect some delay there.

Thanks Manuel.

On Mon, Dec 18, 2017 at 4:54 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello again!

I will be altering s1 tomorrow early european morning. Expect some delay on labs!

Manuel.

On Tue, Dec 12, 2017 at 5:03 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui < manuel@wikimedia.org> wrote:

...
On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org wrote:

> On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui < > manuel@wikimedia.org> wrote: > > Hello Cloud Admins! > > > > As part of https://phabricator.wikimedia.org/T174569 we have to > alter some > > big tables. > > One of them is logging, which, for instance, in wikidata takes > around 8h. > > Which is the shard I am currently working on. > > > > Because of the nature of the change (some columns being added) and > ROW based > > replication (what we use in sanitariums) this change needs to be > done with > > replication (from sanitarium, or their masters, to the labs > servers). > > > > This will obviously generate lag and if not done that way, it will > break > > replication till the column is added on the labs hosts, and this > is less > > desirable than replication lag. > > > > I am planning to run the alter probably tomorrow or Monday (I will > notify > > when I start it) for the sanitarium host in s5, that means that > there will > > be lag on the labs servers, for a few hours, on the s5 instance > (which will > > also affect s1 and s3 because we are using the same replication > thread for > > those shards too - which is a FIXME we have pending). > > > > s2, s4, s6 and s7 will remain unaffected as they have their own > replication > > thread. > > > > Should you have any questions, let me know! > > Should we send a message to cloud-announce about this, or just be > ready to tell people that the lag is a known issue due to production > schema changes? > > Don't think it is necessary to send an announcement about it, it is just maintenance. I would suggest you just just to point people to that task so they can know when other shards will be done too :-)

Manuel.

Manuel Arostegui

17 Jan 17 Jan

3:27 p.m.

Hello,

The last shard: s8, will be altered tomorrow, so expect quite some hours of lag as the wikidatawiki.logging table is quite big.

On Wed, Jan 10, 2018 at 1:23 PM, Manuel Arostegui marostegui@wikimedia.org wrote:

...

Hello,

I will alter s5 tomorrow. Expect some delay there.

On Wed, Jan 3, 2018 at 2:49 PM, Manuel Arostegui <marostegui@wikimedia.org

...
wrote:

...
Happy new year!

Tomorrow I will deploy this change on s7, so expect some delay there.

Thanks Manuel.

On Mon, Dec 18, 2017 at 4:54 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello again!

I will be altering s1 tomorrow early european morning. Expect some delay on labs!

Manuel.

On Tue, Dec 12, 2017 at 5:03 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello!

It is time for s4. I will be doing it tomorrow on the sanitarium master. There will be around 3h delay, as the logging table is quite big and takes around 2-3h to ALTER.

Manuel.

On Thu, Dec 7, 2017 at 8:38 AM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hello,

I will be running this schema change on s2 on Monday. Expect delay on s2 on the replicas.

Manuel.

On Wed, Nov 29, 2017 at 1:53 PM, Manuel Arostegui < marostegui@wikimedia.org> wrote:

...
Hey Cloud Team,

I am now running this schema changes on s3, for all the wikis (around 900). I have throttled it a bit and it has been running for an hour without any significant delay on the new replicas. labsdb1003 is delayed a bit, but it normally is lately, so I don't think it is related to this change. This should take another 15h or so to finish completely.

Cheers Manuel.

On Wed, Nov 15, 2017 at 6:45 PM, Manuel Arostegui < manuel@wikimedia.org> wrote:

> > > On Wed, Nov 15, 2017 at 6:39 PM, Bryan Davis bd808@wikimedia.org > wrote: > >> On Wed, Nov 15, 2017 at 9:48 AM, Manuel Arostegui < >> manuel@wikimedia.org> wrote: >> > Hello Cloud Admins! >> > >> > As part of https://phabricator.wikimedia.org/T174569 we have to >> alter some >> > big tables. >> > One of them is logging, which, for instance, in wikidata takes >> around 8h. >> > Which is the shard I am currently working on. >> > >> > Because of the nature of the change (some columns being added) >> and ROW based >> > replication (what we use in sanitariums) this change needs to be >> done with >> > replication (from sanitarium, or their masters, to the labs >> servers). >> > >> > This will obviously generate lag and if not done that way, it >> will break >> > replication till the column is added on the labs hosts, and this >> is less >> > desirable than replication lag. >> > >> > I am planning to run the alter probably tomorrow or Monday (I >> will notify >> > when I start it) for the sanitarium host in s5, that means that >> there will >> > be lag on the labs servers, for a few hours, on the s5 instance >> (which will >> > also affect s1 and s3 because we are using the same replication >> thread for >> > those shards too - which is a FIXME we have pending). >> > >> > s2, s4, s6 and s7 will remain unaffected as they have their own >> replication >> > thread. >> > >> > Should you have any questions, let me know! >> >> Should we send a message to cloud-announce about this, or just be >> ready to tell people that the lag is a known issue due to production >> schema changes? >> >> > Don't think it is necessary to send an announcement about it, it is > just maintenance. I would suggest you just just to point people to that > task so they can know when other shards will be done too :-) > > Manuel. >

2510

Age (days ago)

2573

Last active (days ago)

cloud-admin@lists.wikimedia.org

9 comments

3 participants

tags (0)

participants (3)

Bryan Davis
Manuel Arostegui
Manuel Arostegui