We will work with Ori to understand what level of effort is required to
support EventLogging. It's likely that Analytics and techops (and Ori) will
need to collaborate on what will need to be done.
Faidon -- I would pull in Andrew, but I'm really concerned about his
workload with the many tasks that need to be done to productize
Kafka/Hadoop. Can you identify another resource who might be able to help
(set up/configure monitoring for example)
On Thu, Mar 20, 2014 at 1:47 PM, Faidon Liambotis <faidon(a)wikimedia.org>wrote;wrote:
On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh
The Upstart job for EventLogging is configured to
re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed
connect, it hit the threshold, and was not
This sounds like a bug. A temporary issue (database unavailability, for
whatever reason) resulting in a permanent crash of the service needing
manual action to restore. This needs to be fixed.
This alert was not responded to. I finally got
pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is
proportional to its usage across the
organization. Analytics, I really
you to step up your involvement.
I can't comment on the general involvement of analytics in this area,
but I do think that responding to Icinga alerts is primarily a techops
responsibility. We can and should escalate as necessary and it's
obviously always nice & appreciated to see non-ops people lurking around
in #wikimedia-operations and jumping in on failures but I don't think
I'd blame anyone else for not reacting to an alert. Especially in this
case, as anyone with a trivial investigation could just come into the
conclusion that a simple restart of the upstart job would fix this
Finally, I think EventLogging Icinga alerts
should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
We generally try to keep paging to a minimum. First, for our personal
sanities :), but more importantly, because if your phone keeps beeping
all day, you become accustomed to it and it will become easier to ignore
a "site is down" alert.
IMO, pages are for very serious alerts. That doesn't mean that the other
(CRITICAL but non-paging) alerts are meant to be ignored for days. In my
experience, I see very few opsens actively monitor the Icinga unhandled
services page (let alone fix random issues or even their own issues as
they see them) and I think we can do better than that.
I personally check that page several times within my day, as well as the
IRC log, but I do wonder what others do or how they feel about this,
especially as we've agreed to scale up the amount of checks (and hence
alerts) that we have.
Ops mailing list