Alert fatigue - High iowait - Cloud-admin

List overview All Threads
Download

newer

Alert fatigue - High iowait

older

Example email for Toolforge Trusty...

CloudVPS IPv6

Giovanni Tirloni

4 Feb 2019 4 Feb '19

1:20 p.m.

Hi,

These emails are causing alert fatigue.

We've tweaked the thresholds high enough to make them rare but they still ocurr and we never take any action (in part because there's nothing feasible to be done until we change our storage situation and/or most workloads are migrated to Kubernetes where we could implement better controls).

I'd like to propose we disable these alerts for the time being and re-evaluate our service level indicators when appropriate.

Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

On Mon, Feb 4, 2019, 01:47 shinken <shinken@shinken-02.shinken.eqiad.wmflabs wrote:

...

Notification Type: RECOVERY

Service: High iowait Host: tools-exec-1419 Address: 10.68.23.223 State: OK

Date/Time: Mon 04 Feb 03:46:59 UTC 2019

Notes URLs:

Additional Info:

OK: All targets OK

Attachments:

attachment.htm (text/html — 1.4 KB)

Show replies by date

Arturo Borrero Gonzalez

4 Feb 4 Feb

1:34 p.m.

On 2/4/19 11:20 AM, Giovanni Tirloni wrote:

...

I'd like to propose we disable these alerts for the time being and re-evaluate our service level indicators when appropriate.

Works for me.

-- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

Brooke Storm

7:23 p.m.

I think they do serve as a good watermark if they are going off in huge numbers, maybe? Honestly, shinken alerts were so noisy in the past that I have them go to a folder that I never check (and may want to change that since puppet alerts are better now). I try to look at it for a general view of what’s bad in the morning but not much else. If I change that filter on my email, perhaps I’d feel more strongly.

We used to take certain pages (like high load on a storage server — now nearly useless) as a sign to go find who is killing NFS or what is wrong. Now, I’m hearing from people that NFS is becoming quite slow at times, but we have no way to really alert on or fix it. I’m not sure these alerts are a good measure either, so I suppose I’m not against removing them.

Maybe I should fix my email filter and start checking them, though, instead? <— Which is a genuine question to see what people think.

I worry that in most cases, there’s not much to do at this point until we can replace what we have in storage and k8s. I’m fixing my email filter anyway since the alerts are less bad, which I should have done ages ago when y’all fixed the puppet alerts :)

Brooke Storm Operations Engineer Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm_

...

On Feb 4, 2019, at 2:20 AM, Giovanni Tirloni gtirloni@wikimedia.org wrote:

Hi,

These emails are causing alert fatigue.

We've tweaked the thresholds high enough to make them rare but they still ocurr and we never take any action (in part because there's nothing feasible to be done until we change our storage situation and/or most workloads are migrated to Kubernetes where we could implement better controls).

I'd like to propose we disable these alerts for the time being and re-evaluate our service level indicators when appropriate.

Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

On Mon, Feb 4, 2019, 01:47 shinken <shinken@shinken-02.shinken.eqiad.wmflabs wrote: Notification Type: RECOVERY

Service: High iowait Host: tools-exec-1419 Address: 10.68.23.223 State: OK

Date/Time: Mon 04 Feb 03:46:59 UTC 2019

Notes URLs:

Additional Info:

OK: All targets OK _______________________________________________ Cloud-admin mailing list Cloud-admin@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud-admin

Giovanni Tirloni

9:48 p.m.

I have all alert emails notify me when they arrive so I read all of them.

However, I've given up investigating high iowait alerts because it's a never ending job (stare at iotop/top and try to figure out what might be causing it, maybe kill a process, etc -- then an hour has passed).

Is there something else we could alert on that would be more meaningful to the end user experience (and allow us to take some action)?

Re: relying on aggregating many alert emails to indicate something, it requires a human judgment (look at email folder, read emails, count how many, make a judgment if that looks bad, etc). I'd rather look at a Grafana dashboard (since Prometheus still is collecting iowait) _when_ there's a real problem, we just need to define what a real problem looks like.

On 2/4/19 2:23 PM, Brooke Storm wrote:

...

I think they do serve as a good watermark if they are going off in huge numbers, maybe? Honestly, shinken alerts were so noisy in the past that I have them go to a folder that I never check (and may want to change that since puppet alerts are better now). I try to look at it for a general view of what’s bad in the morning but not much else. If I change that filter on my email, perhaps I’d feel more strongly.

We used to take certain pages (like high load on a storage server — now nearly useless) as a sign to go find who is killing NFS or what is wrong. Now, I’m hearing from people that NFS is becoming quite slow at times, but we have no way to really alert on or fix it. I’m not sure these alerts are a good measure either, so I suppose I’m not against removing them.

Maybe I should fix my email filter and start checking them, though, instead? <— Which is a genuine question to see what people think.

I worry that in most cases, there’s not much to do at this point until we can replace what we have in storage and k8s. I’m fixing my email filter anyway since the alerts are less bad, which I should have done ages ago when y’all fixed the puppet alerts :)

Brooke Storm Operations Engineer Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm_

...
On Feb 4, 2019, at 2:20 AM, Giovanni Tirloni <gtirloni@wikimedia.org mailto:gtirloni@wikimedia.org> wrote:

Hi,

These emails are causing alert fatigue.

We've tweaked the thresholds high enough to make them rare but they still ocurr and we never take any action (in part because there's nothing feasible to be done until we change our storage situation and/or most workloads are migrated to Kubernetes where we could implement better controls).

I'd like to propose we disable these alerts for the time being and re-evaluate our service level indicators when appropriate.

Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

On Mon, Feb 4, 2019, 01:47 shinken <shinken@shinken-02.shinken.eqiad.wmflabs mailto:shinken@shinken-02.shinken.eqiad.wmflabs wrote:
Notification Type: RECOVERY

Service: High iowait
Host: tools-exec-1419
Address: 10.68.23.223
State: OK

Date/Time: Mon 04 Feb 03:46:59 UTC 2019

Notes URLs:

Additional Info:

OK: All targets OK
Cloud-admin mailing list Cloud-admin@lists.wikimedia.org mailto:Cloud-admin@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud-admin
Cloud-admin mailing list Cloud-admin@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud-admin

-- Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

Brooke Storm

10:05 p.m.

...

On Feb 4, 2019, at 10:48 AM, Giovanni Tirloni gtirloni@wikimedia.org wrote:

Re: relying on aggregating many alert emails to indicate something, it requires a human judgment (look at email folder, read emails, count how many, make a judgment if that looks bad, etc). I'd rather look at a Grafana dashboard (since Prometheus still is collecting iowait) _when_ there's a real problem, we just need to define what a real problem looks like.

Definitely! We used to determine real problems via load numbers on the NFS servers…you know how that ended up. :)

Overall, while I’ve fixed my email filter, I think I’m with you on this. Let’s nix those alerts.

2124

Age (days ago)

2124

Last active (days ago)

cloud-admin@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Arturo Borrero Gonzalez
Brooke Storm
Giovanni Tirloni