To contextualize this a bit more, this is one of the changes discussed in the Alerts
Review proposal
<https://docs.google.com/document/d/1PQKabMx9qoAKQS6qlHJDs2z2B_Bum_KqLYRaZ1pzXGc/edit>.
We are still seeking feedback for the proposals, so if you haven't read/responded yet,
this is a great time!
Thanks to Ben for your help moving this forward.
Best,
Brian King
SRE, Data Platform/Search Platform
Wikimedia Foundation
IRC: inflatador
On Feb 9, 2024, at 9:52 AM, Ben Tullis
<btullis(a)wikimedia.org> wrote:
Hello,
This is just a quick message to let you know that we made some changes today to the
monitoring configuration of many of the Data Platform Engineering servers. This may affect
you if you participate in Ops Week
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week> for Data Engineering
and friends.
By default, all notification alerts from Icinga and Prometheus will now go to
data-platform-alerts(a)wikimedia.org
<https://groups.google.com/a/wikimedia.org/g/data-platform-alerts> instead of
data-engineering-alerts(a)lists.wikimedia.org
<https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/>
We are working to try to make sure that we can route any alert emails (and IRC pings) to
the most appropriate team, principally so that we don't overload the person who is on
Ops Week with a lot of messages that would be more appropriately routed to Data Platform
SREs.
Any scheduled tasks related to data pipelines and services critical for data processing
are still going to be sent to the data-engineering-alerts(a)lists.wikimedia.org
<https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/>
list, so that's Airflow jobs, Refine tasks, Gobblin, Sqoop, Varnishkafka, Eventlogging
etc.
We haven't made any changes to the monitoring/notification settings of the Search and
Query Services servers (Elasticsearch/WDQS/WCQS etc) nor have we made any changes to the
Dumps servers. This mainly affects the analytics systems
<https://wikitech.wikimedia.org/wiki/Analytics/Systems> and the rest of the Data
Engineering team's infrastructure.
Please do let us know if you have any queries or concerns about this change, or if
anything doesn't look right to you.
You can reach out on Slack at #data-engineering-collab or #data-platform-sre or on IRC at
#wikimedia-analytics or #wikimedia-data-platform or to
data-platform-engineering(a)wikimedia.org
<mailto:data-platform-engineering@wikimedia.org> by email.
Kind regards,
Ben
--
Ben Tullis (he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>