On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:
The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.
This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed.
This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.
I can't comment on the general involvement of analytics in this area, but I do think that responding to Icinga alerts is primarily a techops responsibility. We can and should escalate as necessary and it's obviously always nice & appreciated to see non-ops people lurking around in #wikimedia-operations and jumping in on failures but I don't think I'd blame anyone else for not reacting to an alert. Especially in this case, as anyone with a trivial investigation could just come into the conclusion that a simple restart of the upstart job would fix this (AIUI).
Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.
We generally try to keep paging to a minimum. First, for our personal sanities :), but more importantly, because if your phone keeps beeping all day, you become accustomed to it and it will become easier to ignore a "site is down" alert.
IMO, pages are for very serious alerts. That doesn't mean that the other (CRITICAL but non-paging) alerts are meant to be ignored for days. In my experience, I see very few opsens actively monitor the Icinga unhandled services page (let alone fix random issues or even their own issues as they see them) and I think we can do better than that.
I personally check that page several times within my day, as well as the IRC log, but I do wonder what others do or how they feel about this, especially as we've agreed to scale up the amount of checks (and hence alerts) that we have.
Faidon