eventlogging master

List overview All Threads
Download

newer

older

[Technical] Eventlogging...

Wiki Statistics

Sean Pringle

15 Feb 2015 15 Feb '15

10:07 p.m.

I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:

- Backfilling causes replication lag. Using faster out-of-band replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.

- Master disk space. Even with the data purging discussed at the MW Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.

- EL is the service most sensitive to connection dropouts. Recently Ori and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.

I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.

For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?

Sean

Attachments:

attachment.htm (text/html — 1.5 KB)

Show replies by date

Nuria Ruiz

16 Feb 16 Feb

10:59 a.m.

...

For switchover of writes, we'll need to coordinate an EL consumer restart

to use a new CNAME of m4-master.eqiad.wmnet This is configuration change on the EL config plus a small downtime and a re-start (easy). I am not sure how user /passwords are setup on the config so cc-ing otto to keep him in the loop.

...

allow vanadium the relevant network access, and then presumably do a

little backfilling. Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.

...

When would be a reasonable time within the next fortnight or so?

I think next week would work once backfiling for the past outages is over -if it does work for you-

Thanks,

Nuria

On Sun, Feb 15, 2015 at 8:07 PM, Sean Pringle springle@wikimedia.org wrote:

...

I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:

Backfilling causes replication lag. Using faster out-of-band replication

for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.

Master disk space. Even with the data purging discussed at the MW

Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.

EL is the service most sensitive to connection dropouts. Recently Ori

and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.

I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.

For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?

Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

17 Feb 17 Feb

3:43 p.m.

It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.

On Mon, Feb 16, 2015 at 11:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
For switchover of writes, we'll need to coordinate an EL consumer

restart to use a new CNAME of m4-master.eqiad.wmnet This is configuration change on the EL config plus a small downtime and a re-start (easy). I am not sure how user /passwords are setup on the config so cc-ing otto to keep him in the loop.

...
allow vanadium the relevant network access, and then presumably do a

little backfilling. Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.

...
When would be a reasonable time within the next fortnight or so?

I think next week would work once backfiling for the past outages is over -if it does work for you-

Thanks,

Nuria

On Sun, Feb 15, 2015 at 8:07 PM, Sean Pringle springle@wikimedia.org wrote:

...
I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:

Backfilling causes replication lag. Using faster out-of-band

replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.

Master disk space. Even with the data purging discussed at the MW

Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.

EL is the service most sensitive to connection dropouts. Recently Ori

and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.

I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.

For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?

Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

20 Feb 20 Feb

7:23 p.m.

On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.

Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.

Nuria Ruiz

21 Feb 21 Feb

6:59 p.m.

Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363

However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.

On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:

...

On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It seems to me that we should do the Vanadium hardware upgrade at the

same

...
time, if we're going to have down time anyway. Can we bring the new

beefier

...
box online? Last time I talked to Ori I understood such a box was

already

...
set aside for this by ops.

Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

7:08 p.m.

Sorry - this is my bad for not tying these threads together.

I saw that Dan suggested we replace vanadium at the same time we move the master. I've been concerned about EL capacity for a while now and it seemed like a good chance to take some downtime and fix both issues.

At the very least we should co-ordinate the process. What do we need to do this?

...

On Feb 21, 2015, at 16:59, Nuria Ruiz nuria@wikimedia.org wrote:

Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363

However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.

...
On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote: On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.

Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Ori Livneh

8:56 p.m.

Nuria,

Toby asked me on Friday if I could help, and I gave a weak "ummmmm...... sure". I haven't begun working on it or anything. Announcing it to the list is Toby's way of making sure I don't back out. It's why he gets the big Director salary. :) Let's coordinate on Monday?

Best, Ori

On Sat, Feb 21, 2015 at 4:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363

However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.

On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:

...
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It seems to me that we should do the Vanadium hardware upgrade at the

same

...
time, if we're going to have down time anyway. Can we bring the new

beefier

...
box online? Last time I talked to Ori I understood such a box was

already

...
set aside for this by ops.

Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

9:20 p.m.

Coordination on Monday sounds good.

On Sat, Feb 21, 2015 at 6:56 PM, Ori Livneh ori@wikimedia.org wrote:

...

Nuria,

Toby asked me on Friday if I could help, and I gave a weak "ummmmm...... sure". I haven't begun working on it or anything. Announcing it to the list is Toby's way of making sure I don't back out. It's why he gets the big Director salary. :) Let's coordinate on Monday?

Best, Ori

On Sat, Feb 21, 2015 at 4:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363

However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.

On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:

...
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It seems to me that we should do the Vanadium hardware upgrade at the

same

...
time, if we're going to have down time anyway. Can we bring the new

beefier

...
box online? Last time I talked to Ori I understood such a box was

already

...
set aside for this by ops.

Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

25 Feb 25 Feb

7:25 p.m.

On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

Coordination on Monday sounds good.

Did you guys come to any conclusion about vanadium?

Andre Klapper

16 Mar 16 Mar

4:09 a.m.

On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:

...

On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
Coordination on Monday sounds good.

Did you guys come to any conclusion about vanadium?

Could someone answer springle's question? Or has it already been answered in another communication channel?

andre

-- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Nuria Ruiz

12:04 p.m.

...

Did you guys come to any conclusion about vanadium?

Sorry about missing this. Ori has requested two EL hosts, those were granted two weeks ago and now it's on our court to replace vanadium with one of those.

The replacement of EL DB master box has already taken place.

On Mon, Mar 16, 2015 at 2:09 AM, Andre Klapper aklapper@wikimedia.org wrote:

...

On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:

...
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
Coordination on Monday sounds good.

Did you guys come to any conclusion about vanadium?

Could someone answer springle's question? Or has it already been answered in another communication channel?

andre

Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

1:48 p.m.

Ticket for box upgrade is here: https://phabricator.wikimedia.org/T90363

On Mon, Mar 16, 2015 at 10:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
Did you guys come to any conclusion about vanadium?

Sorry about missing this. Ori has requested two EL hosts, those were granted two weeks ago and now it's on our court to replace vanadium with one of those.

The replacement of EL DB master box has already taken place.

On Mon, Mar 16, 2015 at 2:09 AM, Andre Klapper aklapper@wikimedia.org wrote:

...
On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:

...
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org

wrote:

...
...
Coordination on Monday sounds good.

Did you guys come to any conclusion about vanadium?

Could someone answer springle's question? Or has it already been answered in another communication channel?

andre

Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

20 Feb 20 Feb

7:21 p.m.

On Tue, Feb 17, 2015 at 2:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.

Yep.

...

...
When would be a reasonable time within the next fortnight or so?

I think next week would work once backfiling for the past outages is over -if it does work for you-

Sounds good. We'll wait for backfilling to complete.

3583

Age (days ago)

3611

Last active (days ago)

analytics@lists.wikimedia.org

12 comments

6 participants

tags (0)

participants (6)

Andre Klapper
Dan Andreescu
Nuria Ruiz
Ori Livneh
Sean Pringle
Toby Negrin