On Mon, Feb 16, 2015 at 11:59 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:

>For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet
This is configuration change on the EL config plus a small downtime and a re-start (easy). I am not sure how user /passwords are setup on the config so cc-ing otto to keep him in the loop.

>allow vanadium the relevant network access, and then presumably do a little backfilling.
Vanadium network access is something that I imagine ops needs to do as I doubt we will have permits do do a network change.

>When would be a reasonable time within the next fortnight or so?
I think next week would work once backfiling for the past outages is over -if it does work for you-

Thanks,

Nuria

On Sun, Feb 15, 2015 at 8:07 PM, Sean Pringle <springle@wikimedia.org> wrote:
I think we should split up Eventlogging and the other m2 clients (OTRS and some minor players). Several reasons:

- Backfilling causes replication lag. Using faster out-of-band replication for EL is easy because it is all simple bulk-INSERT statements, but the same does not apply for the other clients. They need different approaches.

- Master disk space. Even with the data purging discussed at the MW Summit, I would feel better if EL had more headroom that is does currently, and zero possibility of unexpected spikes in disk activity and usage affecting other services.

- EL is the service most sensitive to connection dropouts. Recently Ori and Nuria have been tweaking SqlAlchemy, but future connection problems like those seen last week would be easier to debug without having to risk affecting other services.

I am therefore arranging to promote the current m2 slave db1046 to master of an m4 cluster tuned for EL, including backfilling. Analytics-store, s1-analytics-slave, and the new CODFW server will simply switch to replicate from the new master.

For switchover of writes, we'll need to coordinate an EL consumer restart to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant network access, and then presumably do a little backfilling. When would be a reasonable time within the next fortnight or so?

Sean

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics