[Labs-l] Custom nagios checks

Tim Landscheidt tim at tim-landscheidt.de
Mon Feb 3 20:57:38 UTC 2014


Petr Bena <benapetr at gmail.com> wrote:

> I think it's a time to finally make it possible for users to create
> own check for nagios (icinga).

> I will try to document how current icinga is setup on
> https://wikitech.wikimedia.org/wiki/Icinga/Labs

> Currently there is a nagiosbuilder which is a python script made by
> Damian which query the ldap and build nagios cfg files based on that.

> My idea is to create configuration files / templates for this
> nagiosbuilder so that it would apply different options for certain
> hosts based on this configuration. Users would just

>  * create own check, place it somewhere on the server which they want
> to monitor the service at and insert it to nrpe
> (/etc/nagios/nrpe.d/yourservice.cfg)
> * use some interface (to be discussed) to insert this check for specific host

> and nagiosbuilder would

> * query that interface in order to generate configuration files
> * based on this config would set up custom services for these hosts

> For us (developers) it's most easy to use gerrit as this interface, so
> that people would directly update these configuration files used by
> nagiosbuilder, however that pretty much suck, so I think it would be
> better to create a new interface into labsconsole, so that people can
> define their nagios checks directly as a property of each node.

> That of course would require more coding and ops assistance but I
> think it's doable. Some opinions?

a) Great to see some progress on that front :-).

b) I think differences between production and Labs hosts
   should be minimal.  If we have to add (and sync!) checks
   manually, it's gonna be a big mess.  If I configure an
   instance with the class redis, that class's monitoring
   should be used without any further intervention.  (If I
   set up my own puppetmaster and use changes not committed
   to the WMF repository yet, that would be an acceptable
   exception.)

I assume this is most important for Beta
(cf. https://bugzilla.wikimedia.org/51497), but all other
projects managed by operations/puppet would profit from that
as well.  Comment #2 of that bug refers to security reasons
for why we can't copy the production setup verbatim, but be-
fore we introduce another system, I think we should try to
mitigate those concerns first, i. e. see what users can fid-
dle with without review and what impact that has on the
processes run on the puppetmaster and/or Icinga.

What we will need some sort of UI for is the configuration
of alerts.  I assume not every project wants to receive
mails or IRC messages whenever something's broken, and even
in one project there may be different levels or interests.

Tim




More information about the Labs-l mailing list