I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.
- EL is the service most sensitive to connection dropouts. Recently Ori and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?
Sean
For switchover of writes, we'll need to coordinate an EL consumer restart
to use a new CNAME of m4-master.eqiad.wmnet This is configuration change on the EL config plus a small downtime and a re-start (easy). I am not sure how user /passwords are setup on the config so cc-ing otto to keep him in the loop.
allow vanadium the relevant network access, and then presumably do a
little backfilling. Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.
When would be a reasonable time within the next fortnight or so?
I think next week would work once backfiling for the past outages is over -if it does work for you-
Thanks,
Nuria
On Sun, Feb 15, 2015 at 8:07 PM, Sean Pringle springle@wikimedia.org wrote:
I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band replication
for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW
Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.
- EL is the service most sensitive to connection dropouts. Recently Ori
and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?
Sean
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.
On Mon, Feb 16, 2015 at 11:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:
For switchover of writes, we'll need to coordinate an EL consumer
restart to use a new CNAME of m4-master.eqiad.wmnet This is configuration change on the EL config plus a small downtime and a re-start (easy). I am not sure how user /passwords are setup on the config so cc-ing otto to keep him in the loop.
allow vanadium the relevant network access, and then presumably do a
little backfilling. Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.
When would be a reasonable time within the next fortnight or so?
I think next week would work once backfiling for the past outages is over -if it does work for you-
Thanks,
Nuria
On Sun, Feb 15, 2015 at 8:07 PM, Sean Pringle springle@wikimedia.org wrote:
I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band
replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW
Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.
- EL is the service most sensitive to connection dropouts. Recently Ori
and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?
Sean
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.
Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.
Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363
However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.
On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
It seems to me that we should do the Vanadium hardware upgrade at the
same
time, if we're going to have down time anyway. Can we bring the new
beefier
box online? Last time I talked to Ori I understood such a box was
already
set aside for this by ops.
Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Sorry - this is my bad for not tying these threads together.
I saw that Dan suggested we replace vanadium at the same time we move the master. I've been concerned about EL capacity for a while now and it seemed like a good chance to take some downtime and fix both issues.
At the very least we should co-ordinate the process. What do we need to do this?
On Feb 21, 2015, at 16:59, Nuria Ruiz nuria@wikimedia.org wrote:
Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363
However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.
On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote: On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
It seems to me that we should do the Vanadium hardware upgrade at the same time, if we're going to have down time anyway. Can we bring the new beefier box online? Last time I talked to Ori I understood such a box was already set aside for this by ops.
Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Nuria,
Toby asked me on Friday if I could help, and I gave a weak "ummmmm...... sure". I haven't begun working on it or anything. Announcing it to the list is Toby's way of making sure I don't back out. It's why he gets the big Director salary. :) Let's coordinate on Monday?
Best, Ori
On Sat, Feb 21, 2015 at 4:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363
However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.
On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
It seems to me that we should do the Vanadium hardware upgrade at the
same
time, if we're going to have down time anyway. Can we bring the new
beefier
box online? Last time I talked to Ori I understood such a box was
already
set aside for this by ops.
Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Coordination on Monday sounds good.
On Sat, Feb 21, 2015 at 6:56 PM, Ori Livneh ori@wikimedia.org wrote:
Nuria,
Toby asked me on Friday if I could help, and I gave a weak "ummmmm...... sure". I haven't begun working on it or anything. Announcing it to the list is Toby's way of making sure I don't back out. It's why he gets the big Director salary. :) Let's coordinate on Monday?
Best, Ori
On Sat, Feb 21, 2015 at 4:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Looks like Ori is working on replacing vanadium. There is a ticket created to that effect: https://phabricator.wikimedia.org/T90363
However, I am not sure the best way to proceed forward is to change two things at the same time (the db and vanadium). But since that is purely and ops tasks you guys know what's best.
On Fri, Feb 20, 2015 at 5:23 PM, Sean Pringle springle@wikimedia.org wrote:
On Wed, Feb 18, 2015 at 7:43 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
It seems to me that we should do the Vanadium hardware upgrade at the
same
time, if we're going to have down time anyway. Can we bring the new
beefier
box online? Last time I talked to Ori I understood such a box was
already
set aside for this by ops.
Oooh I didn't (and don't) know about this. Fair enough; one chunk of scheduled downtime is fine with me, though I don't know how long it will need to be if we include vanadium. If we do only the database split it will be five minutes.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Coordination on Monday sounds good.
Did you guys come to any conclusion about vanadium?
On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Coordination on Monday sounds good.
Did you guys come to any conclusion about vanadium?
Could someone answer springle's question? Or has it already been answered in another communication channel?
andre
Did you guys come to any conclusion about vanadium?
Sorry about missing this. Ori has requested two EL hosts, those were granted two weeks ago and now it's on our court to replace vanadium with one of those.
The replacement of EL DB master box has already taken place.
On Mon, Mar 16, 2015 at 2:09 AM, Andre Klapper aklapper@wikimedia.org wrote:
On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Coordination on Monday sounds good.
Did you guys come to any conclusion about vanadium?
Could someone answer springle's question? Or has it already been answered in another communication channel?
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ticket for box upgrade is here: https://phabricator.wikimedia.org/T90363
On Mon, Mar 16, 2015 at 10:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Did you guys come to any conclusion about vanadium?
Sorry about missing this. Ori has requested two EL hosts, those were granted two weeks ago and now it's on our court to replace vanadium with one of those.
The replacement of EL DB master box has already taken place.
On Mon, Mar 16, 2015 at 2:09 AM, Andre Klapper aklapper@wikimedia.org wrote:
On Thu, 2015-02-26 at 11:25 +1000, Sean Pringle wrote:
On Sun, Feb 22, 2015 at 1:20 PM, Nuria Ruiz nuria@wikimedia.org
wrote:
Coordination on Monday sounds good.
Did you guys come to any conclusion about vanadium?
Could someone answer springle's question? Or has it already been answered in another communication channel?
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Feb 17, 2015 at 2:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.
Yep.
When would be a reasonable time within the next fortnight or so?
I think next week would work once backfiling for the past outages is over -if it does work for you-
Sounds good. We'll wait for backfilling to complete.