I think they do serve as a good watermark if they are going off in huge numbers, maybe?
Honestly, shinken alerts were so noisy in the past that I have them go to a folder that I
never check (and may want to change that since puppet alerts are better now). I try to
look at it for a general view of what’s bad in the morning but not much else. If I change
that filter on my email, perhaps I’d feel more strongly.
We used to take certain pages (like high load on a storage server — now nearly useless) as
a sign to go find who is killing NFS or what is wrong. Now, I’m hearing from people that
NFS is becoming quite slow at times, but we have no way to really alert on or fix it. I’m
not sure these alerts are a good measure either, so I suppose I’m not against removing
them.
Maybe I should fix my email filter and start checking them, though, instead? <— Which
is a genuine question to see what people think.
I worry that in most cases, there’s not much to do at this point until we can replace what
we have in storage and k8s. I’m fixing my email filter anyway since the alerts are less
bad, which I should have done ages ago when y’all fixed the puppet alerts :)
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
On Feb 4, 2019, at 2:20 AM, Giovanni Tirloni
<gtirloni(a)wikimedia.org> wrote:
Hi,
These emails are causing alert fatigue.
We've tweaked the thresholds high enough to make them rare but they still ocurr and
we never take any action (in part because there's nothing feasible to be done until we
change our storage situation and/or most workloads are migrated to Kubernetes where we
could implement better controls).
I'd like to propose we disable these alerts for the time being and re-evaluate our
service level indicators when appropriate.
Giovanni Tirloni
Operations Engineer
Wikimedia Cloud Services
On Mon, Feb 4, 2019, 01:47 shinken <shinken(a)shinken-02.shinken.eqiad.wmflabs wrote:
Notification Type: RECOVERY
Service: High iowait
Host: tools-exec-1419
Address: 10.68.23.223
State: OK
Date/Time: Mon 04 Feb 03:46:59 UTC 2019
Notes URLs:
Additional Info:
OK: All targets OK
_______________________________________________
Cloud-admin mailing list
Cloud-admin(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/cloud-admin