Hello,
This is just a quick message to let you know that we made some changes today to the monitoring configuration of many of the Data Platform Engineering servers. This may affect you if you participate in Ops Week https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week for Data Engineering and friends.
By default, all notification alerts from Icinga and Prometheus will now go to data-platform-alerts@wikimedia.org https://groups.google.com/a/wikimedia.org/g/data-platform-alerts instead of data-engineering-alerts@lists.wikimedia.org https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/
We are working to try to make sure that we can route any alert emails (and IRC pings) to the most appropriate team, principally so that we don't overload the person who is on Ops Week with a lot of messages that would be more appropriately routed to Data Platform SREs.
Any scheduled tasks related to data pipelines and services critical for data processing are still going to be sent to the data-engineering-alerts@lists.wikimedia.org https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/ list, so that's Airflow jobs, Refine tasks, Gobblin, Sqoop, Varnishkafka, Eventlogging etc.
We haven't made any changes to the monitoring/notification settings of the Search and Query Services servers (Elasticsearch/WDQS/WCQS etc) nor have we made any changes to the Dumps servers. This mainly affects the analytics systems https://wikitech.wikimedia.org/wiki/Analytics/Systems and the rest of the Data Engineering team's infrastructure.
Please do let us know if you have any queries or concerns about this change, or if anything doesn't look right to you.
You can reach out on Slack at #data-engineering-collab or #data-platform-sre or on IRC at #wikimedia-analytics or #wikimedia-data-platform or to data-platform-engineering@wikimedia.org by email.
Kind regards, Ben
To contextualize this a bit more, this is one of the changes discussed in the Alerts Review proposal https://docs.google.com/document/d/1PQKabMx9qoAKQS6qlHJDs2z2B_Bum_KqLYRaZ1pzXGc/edit. We are still seeking feedback for the proposals, so if you haven't read/responded yet, this is a great time!
Thanks to Ben for your help moving this forward.
Best,
Brian King SRE, Data Platform/Search Platform Wikimedia Foundation IRC: inflatador
On Feb 9, 2024, at 9:52 AM, Ben Tullis btullis@wikimedia.org wrote:
Hello,
This is just a quick message to let you know that we made some changes today to the monitoring configuration of many of the Data Platform Engineering servers. This may affect you if you participate in Ops Week https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week for Data Engineering and friends.
By default, all notification alerts from Icinga and Prometheus will now go to data-platform-alerts@wikimedia.org https://groups.google.com/a/wikimedia.org/g/data-platform-alerts instead of data-engineering-alerts@lists.wikimedia.org https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/ We are working to try to make sure that we can route any alert emails (and IRC pings) to the most appropriate team, principally so that we don't overload the person who is on Ops Week with a lot of messages that would be more appropriately routed to Data Platform SREs.
Any scheduled tasks related to data pipelines and services critical for data processing are still going to be sent to the data-engineering-alerts@lists.wikimedia.org https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/ list, so that's Airflow jobs, Refine tasks, Gobblin, Sqoop, Varnishkafka, Eventlogging etc.
We haven't made any changes to the monitoring/notification settings of the Search and Query Services servers (Elasticsearch/WDQS/WCQS etc) nor have we made any changes to the Dumps servers. This mainly affects the analytics systems https://wikitech.wikimedia.org/wiki/Analytics/Systems and the rest of the Data Engineering team's infrastructure.
Please do let us know if you have any queries or concerns about this change, or if anything doesn't look right to you.
You can reach out on Slack at #data-engineering-collab or #data-platform-sre or on IRC at #wikimedia-analytics or #wikimedia-data-platform or to data-platform-engineering@wikimedia.org mailto:data-platform-engineering@wikimedia.org by email.
Kind regards, Ben
-- Ben Tullis (he/him) Senior Site Reliability Engineer Wikimedia Foundation https://wikimediafoundation.org/