Re: [Analytics] [Ops] EventLogging postmortem, and maintenance responsibilities

21 Mar 2014


      On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:
...
The Upstart job for EventLogging is configured to re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed to
connect, it hit the threshold, and was not re-spawned.
This sounds like a bug. A temporary issue (database unavailability, for
whatever reason) resulting in a permanent crash of the service needing
manual action to restore. This needs to be fixed.
...
This alert was not responded to. I finally got pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who noticed
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not
proportional to its usage across the organization. Analytics, I really need
you to step up your involvement.
I can't comment on the general involvement of analytics in this area,
but I do think that responding to Icinga alerts is primarily a techops
responsibility. We can and should escalate as necessary and it's
obviously always nice & appreciated to see non-ops people lurking around
in #wikimedia-operations and jumping in on failures but I don't think
I'd blame anyone else for not reacting to an alert. Especially in this
case, as anyone with a trivial investigation could just come into the
conclusion that a simple restart of the upstart job would fix this
(AIUI).
...
Finally, I think EventLogging Icinga alerts should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
vanadium:/var/log/upstart/eventlogging-*.
We generally try to keep paging to a minimum. First, for our personal
sanities :), but more importantly, because if your phone keeps beeping
all day, you become accustomed to it and it will become easier to ignore
a "site is down" alert.
IMO, pages are for very serious alerts. That doesn't mean that the other
(CRITICAL but non-paging) alerts are meant to be ignored for days. In my
experience, I see very few opsens actively monitor the Icinga unhandled
services page (let alone fix random issues or even their own issues as
they see them) and I think we can do better than that.
I personally check that page several times within my day, as well as the
IRC log, but I do wonder what others do or how they feel about this,
especially as we've agreed to scale up the amount of checks (and hence
alerts) that we have.
Faidon

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] EventLogging postmortem, and maintenance responsibilities