Hello,
This is just a quick message to let you know that we made some
changes today to the monitoring configuration of many of the Data
Platform Engineering servers. This may affect you if you
participate in Ops
Week for Data Engineering and friends.
By default, all notification alerts from Icinga and Prometheus will now go to data-platform-alerts@wikimedia.org instead of data-engineering-alerts@lists.wikimedia.org
We are working to try to make sure that we can route any alert emails (and IRC pings) to the most appropriate team, principally so that we don't overload the person who is on Ops Week with a lot of messages that would be more appropriately routed to Data Platform SREs.
Any scheduled tasks related to data pipelines and services critical for data processing are still going to be sent to the data-engineering-alerts@lists.wikimedia.org list, so that's Airflow jobs, Refine tasks, Gobblin, Sqoop, Varnishkafka, Eventlogging etc.
We haven't made any changes to the monitoring/notification
settings of the Search and Query Services servers
(Elasticsearch/WDQS/WCQS etc) nor have we made any changes to the
Dumps servers. This mainly affects the analytics
systems and the rest of the Data Engineering team's
infrastructure.
Please do let us know if you have any queries or concerns about
this change, or if anything doesn't look right to you.
You can reach out on Slack at #data-engineering-collab or
#data-platform-sre or on IRC at #wikimedia-analytics or
#wikimedia-data-platform or to
data-platform-engineering@wikimedia.org by email.
Kind regards,
Ben
Ben Tullis (he/him) Senior Site Reliability Engineer Wikimedia Foundation |