On Thu, Mar 20, 2014 at 1:47 PM, Faidon Liambotis <faidon@wikimedia.org> wrote:

On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:
> The Upstart job for EventLogging is configured to re-spawn the writer, up
> to a certain threshold of failures. Because the writer repeatedly failed to
> connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for
whatever reason) resulting in a permanent crash of the service needing
manual action to restore. This needs to be fixed.

> This alert was not responded to. I finally got pinged by Tillman, who
> noticed the blog visitor stats report was blank, and by Gilles, who noticed
> image loading performance data was missing.
>
> We have to fix this. The level of maintenance that EventLogging gets is not
> proportional to its usage across the organization. Analytics, I really need
> you to step up your involvement.

I can't comment on the general involvement of analytics in this area,
but I do think that responding to Icinga alerts is primarily a techops
responsibility. We can and should escalate as necessary and it's
obviously always nice & appreciated to see non-ops people lurking around
in #wikimedia-operations and jumping in on failures but I don't think
I'd blame anyone else for not reacting to an alert. Especially in this
case, as anyone with a trivial investigation could just come into the
conclusion that a simple restart of the upstart job would fix this
(AIUI).

> Finally, I think EventLogging Icinga alerts should have a higher profile,
> and possibly page someone. Issues can usually be debugged using the
> eventloggingctl tool on Vanadium and by inspecting the log files on
> vanadium:/var/log/upstart/eventlogging-*.

We generally try to keep paging to a minimum. First, for our personal
sanities :), but more importantly, because if your phone keeps beeping
all day, you become accustomed to it and it will become easier to ignore
a "site is down" alert.

IMO, pages are for very serious alerts. That doesn't mean that the other
(CRITICAL but non-paging) alerts are meant to be ignored for days. In my
experience, I see very few opsens actively monitor the Icinga unhandled
services page (let alone fix random issues or even their own issues as
they see them) and I think we can do better than that.

I personally check that page several times within my day, as well as the
IRC log, but I do wonder what others do or how they feel about this,
especially as we've agreed to scale up the amount of checks (and hence
alerts) that we have.

Faidon

_______________________________________________
Ops mailing list
Ops@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops