At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list.
During the time that the database server was out or rejecting connection, the EventLogging writer that writes to db1047 was repeatedly failing to connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert: [00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047
This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.
It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.
Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed < https://rt.wikimedia.org/Ticket/Display.html?id=7081%3E to request the migration.
There is some code rot around the Ganglia and Graphite monitoring code for EventLogging. I don't think it would take much to fix. Could the Analytics team take this on?
The Puppet code is well-documented. < https://wikitech.wikimedia.org/wiki/EventLogging%3E could use some updating, but it is mostly current.
Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.
--- Ori Livneh ori@wikimedia.org