Hi!
Well, we only noticed what was up due to this email!
Take a look at
https://phabricator.wikimedia.org/T119915
Yes, we need to look into it. The problem is that the service has two
failure modes:
1. Completely dead, rejecting all queries. This would be caught by
icinga and alerted.
2. Crawling slow, but still partially alive, just performing very very
badly. For this one, we do not have adequate alert system. This failure
mode is rare, but we've seen it to happen, both due to somebody sending
a torrent of heavy queries and some bug scenarios. Icinga does not catch
that because it only checks very basic queries and those are still under
timeout.
--
Stas Malyshev
smalyshev(a)wikimedia.org