We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

Faidon -- I would pull in Andrew, but I'm really concerned about his workload with the many tasks that need to be done to productize Kafka/Hadoop. Can you identify another resource who might be able to help (set up/configure monitoring for example)


On Thu, Mar 20, 2014 at 1:47 PM, Faidon Liambotis <faidon@wikimedia.org> wrote:
On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:
> The Upstart job for EventLogging is configured to re-spawn the writer, up
> to a certain threshold of failures. Because the writer repeatedly failed to
> connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for
whatever reason) resulting in a permanent crash of the service needing
manual action to restore. This needs to be fixed.

> This alert was not responded to. I finally got pinged by Tillman, who
> noticed the blog visitor stats report was blank, and by Gilles, who noticed
> image loading performance data was missing.
> We have to fix this. The level of maintenance that EventLogging gets is not
> proportional to its usage across the organization. Analytics, I really need
> you to step up your involvement.

I can't comment on the general involvement of analytics in this area,
but I do think that responding to Icinga alerts is primarily a techops
responsibility. We can and should escalate as necessary and it's
obviously always nice & appreciated to see non-ops people lurking around
in #wikimedia-operations and jumping in on failures but I don't think
I'd blame anyone else for not reacting to an alert. Especially in this
case, as anyone with a trivial investigation could just come into the
conclusion that a simple restart of the upstart job would fix this

> Finally, I think EventLogging Icinga alerts should have a higher profile,
> and possibly page someone. Issues can usually be debugged using the
> eventloggingctl tool on Vanadium and by inspecting the log files on
> vanadium:/var/log/upstart/eventlogging-*.

We generally try to keep paging to a minimum. First, for our personal
sanities :), but more importantly, because if your phone keeps beeping
all day, you become accustomed to it and it will become easier to ignore
a "site is down" alert.

IMO, pages are for very serious alerts. That doesn't mean that the other
(CRITICAL but non-paging) alerts are meant to be ignored for days. In my
experience, I see very few opsens actively monitor the Icinga unhandled
services page (let alone fix random issues or even their own issues as
they see them) and I think we can do better than that.

I personally check that page several times within my day, as well as the
IRC log, but I do wonder what others do or how they feel about this,
especially as we've agreed to scale up the amount of checks (and hence
alerts) that we have.


Ops mailing list