At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming
connections. At some point during the subsequent hour, MariaDB had either
crashed or been manually restarted. Sean noticed that the database was
choking on some queries from the researchers and notified the wmfresearch
list.
During the time that the database server was out or rejecting connection,
the EventLogging writer that writes to db1047 was repeatedly failing to
connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect
to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed to
connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert:
[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs
on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs:
consumer/mysql-db1047
This alert was not responded to. I finally got pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who noticed
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not
proportional to its usage across the organization. Analytics, I really need
you to step up your involvement.
It was not long ago that EventLogging was running reliably for months at a
time. What has changed is not system load, but the owner seat becoming
vacant, leading to a gradual deterioration of the quality of monitoring and
auditing practices.
Sean proposed moving the EventLogging database to m2, so that it runs on
separate hardware from the research databases. I think he's right. I filed <
https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the
migration.
There is some code rot around the Ganglia and Graphite monitoring code for
EventLogging. I don't think it would take much to fix. Could the Analytics
team take this on?
The Puppet code is well-documented. <
https://wikitech.wikimedia.org/wiki/EventLogging> could use some updating,
but it is mostly current.
Finally, I think EventLogging Icinga alerts should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
vanadium:/var/log/upstart/eventlogging-*.
---
Ori Livneh
ori(a)wikimedia.org