[Updating only the Analytics list]
I forgot to update this email thread last week. The Event Logging master
database switch went fine but as reported the maintenance window affected
the Eventlogging schema graphs in the Eventlogging Schema dashboard. For
example, this is how the Popups schema looked like:
The gaps are not related to data loss in mysql or data inconsistency,
because those are only Kafka throughput metrics. A little refresh about how
the events are flowing:
Browser --> Varnish cache layer (text/upload) --> Varnishkafka (running on
the caching hosts) --> Kafka cluster <---> Eventlogging ---> Mysql
databases (Eventlogging Master)
I completely stopped the Eventlogging Service while switching the master
database and hence its Kafka consumer metrics reflected this, dropping to
zero (and spiking up when EL was started back again). The event timestamps
are set by Varnishkafka so this action did not affect the final data
This maintenance raised a bit of questions in
, apologies for the
trouble and the time wasted :(
Good news is that the master database was switched without any data loss
and we are now using a more powerful host!
2017-11-14 18:59 GMT+01:00 Luca Toscano <ltoscano(a)wikimedia.org>rg>:
the Analytics team needs to do the following maintenance operations:
1) migrate the Event-Logging master db ('log', currently on db1046) to the
new host db1107 (T156844). This should happen on *Wed Nov 15th (EU
morning)*, and it should be transparent to all the Event Logging users.
The only drawback that might be observed is a delay in getting the latest
records on the analytics db replicas (db1108, db1047, dbstore1002).
2) Reboot thorium and all the stat boxes for Linux kernel updates.
- Thorium hosts all the analytics websites like pivot.wikimedia.org
, etc.. and will be rebooted
on *Wed Nov 15th (EU morning)*, the websites downtime should be minimal
(range of minutes).
- stat boxes (stat1004, stat1005, stat1006) are usually running a lot of
screen/tmux sessions with various data crunching activities, so I'll try to
follow up with all the users currently running something on them to verify
if I can proceed or not. I'd tentatively schedule the reboots on *Thu Nov
16h (EU morning)*, but please follow up with me asap if this needs to be
Thanks in advance and sorry for the trouble!
Luca (on behalf of the Analytics team)