I think we should split up Eventlogging and the other m2 clients (OTRS and
some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band replication
for EL is easy because it is all simple bulk-INSERT statements, but the
same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW Summit,
I would feel better if EL had more headroom that is does currently, and
zero possibility of unexpected spikes in disk activity and usage affecting
other services.
- EL is the service most sensitive to connection dropouts. Recently Ori and
Nuria have been tweaking SqlAlchemy, but future connection problems like
those seen last week would be easier to debug without having to risk
affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master
of an m4 cluster tuned for EL, including backfilling. Analytics-store,
s1-analytics-slave, and the new CODFW server will simply switch to
replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart
to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant
network access, and then presumably do a little backfilling. When would be
a reasonable time within the next fortnight or so?
Sean