[Updating only the Analytics list]

Hi everybody,

I forgot to update this email thread last week. The Event Logging master database switch went fine but as reported the maintenance window affected the Eventlogging schema graphs in the Eventlogging Schema dashboard. For example, this is how the Popups schema looked like:

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=1510704000000&to=1510790399000&var-schema=Popups

The gaps are not related to data loss in mysql or data inconsistency, because those are only Kafka throughput metrics. A little refresh about how the events are flowing:

Browser --> Varnish cache layer (text/upload) --> Varnishkafka (running on the caching hosts) --> Kafka cluster <---> Eventlogging ---> Mysql databases (Eventlogging Master) 

I completely stopped the Eventlogging Service while switching the master database and hence its Kafka consumer metrics reflected this, dropping to zero (and spiking up when EL was started back again). The event timestamps are set by Varnishkafka so this action did not affect the final data quality.

This maintenance raised a bit of questions in https://phabricator.wikimedia.org/T179914#3764603, apologies for the trouble and the time wasted :(

Good news is that the master database was switched without any data loss and we are now using a more powerful host!

Thanks! 

Luca

2017-11-14 18:59 GMT+01:00 Luca Toscano <ltoscano@wikimedia.org>:
Hi everybody,

the Analytics team needs to do the following maintenance operations:

1) migrate the Event-Logging master db ('log', currently on db1046) to the new host db1107 (T156844). This should happen on Wed Nov 15th (EU morning), and it should be transparent to all the Event Logging users. The only drawback that might be observed is a delay in getting the latest records on the analytics db replicas (db1108, db1047, dbstore1002).

2) Reboot thorium and all the stat boxes for Linux kernel updates. 

- Thorium hosts all the analytics websites like pivot.wikimedia.org, yarn.wikimedia.org, analytics.wikimedia.org, etc.. and will be rebooted on Wed Nov 15th (EU morning), the websites downtime should be minimal (range of minutes).
- stat boxes (stat1004, stat1005, stat1006) are usually running a lot of screen/tmux sessions with various data crunching activities, so I'll try to follow up with all the users currently running something on them to verify if I can proceed or not. I'd tentatively schedule the reboots on Thu Nov 16h (EU morning), but please follow up with me asap if this needs to be postponed.

Thanks in advance and sorry for the trouble!

Luca (on behalf of the Analytics team)