Thank you for the detailed write-up Ori

We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Indeed, the owner seat is vacant.  According to a recent discussion on the analytics list, we did not yet consider ourselves the proper owners of EventLogging.  Our sprint planning is today and I'll bring it up and note its importance in light of this down time.

Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed <https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the migration.

Thank you, I support isolation.

Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

I think this is the key reason the failure was ignored, so I agree here.  We should at the very least forward these alerts as an email to analytics devs.  I have no idea how to do that, if anyone would like to help that'd be great.