[Labs-l] Custom nagios checks

Petr Bena benapetr at gmail.com
Tue Feb 4 08:14:00 UTC 2014


I think that Ryan said something like he would most happily get rid of
puppet or replace it with a better solution :P but if you really want
to keep stuff managed by puppet, I still see an issue with other
projects which aren't using puppet, or which do use different
puppetmaster.

To be honest, from my point of view, puppet as it is now on labs is
almost unusable for non-ops users. Getting any simple change merged
unless it's top priority thing requires someone from ops, and usually
take at least few hours if not days. I can't imagine any sysadmin who
can work like this, some changes need to be applied immediately, you
can't wait for them to happen for days, so I expect that waste
majority of projects that exist now will not use puppet anyway (you
just can't force people to use it under these circumstances), so they
wouldn't benefit from this.

That is why I think that even if we are to use this puppet nrpe
management there still should be a way for manual adjustments and not
just because of these projects, but also to fix other icinga issues.
For example right now it receive some nonsense (broken) data from ldap
about instances that don't even exist anymore. If there wasn't that
nasty workaround consisting of instance ignore list, that prevents
these hosts from being monitored, icinga would be full of hosts that
are down. How would you apply i_dont_exist puppet class to nonexisting
node? :P

I have nothing against "labs cloning production" beside that IMHO it
should be the other way (production should actually clone labs, which
is the testing env where changes should happen first before they get
deployed on production), but still labs != production so I think we
could have some extra thing here that would make it easier to manage
icinga for regular, non-ops people which would exist on labs only and
not on production.

On Tue, Feb 4, 2014 at 12:10 AM, Antoine Musso <hashar+wmf at free.fr> wrote:
> Le 03/02/2014 19:32, Petr Bena a écrit :
>> I think it's a time to finally make it possible for users to create
>> own check for nagios (icinga).
>>
>> I will try to document how current icinga is setup on
>> https://wikitech.wikimedia.org/wiki/Icinga/Labs
>>
>> Currently there is a nagiosbuilder which is a python script made by
>> Damian which query the ldap and build nagios cfg files based on that.
> <snip>
>
> Thank you Petan for resurrecting Icinga on labs :-]
>
> In production, the nrpe checks are being transitioned to use a define
> such as:
>
>   nrpe::monitor_service { 'jenkins':
>     description   => 'jenkins_service_running',
>     nrpe_command  => "/usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1
> --ereg-argument-array '^/usr/bin/java .*-jar
> /usr/share/jenkins/jenkins.war'"
>   }
>
> Whenever that is run on an instance, it will provision the nrpe command
> under /etc/nagios/nrpe.d , an example for the beta bastion:
>
> deployment-bastion$ ls -1 /etc/nagios/nrpe.d
> check_disk_space.cfg
> check_dpkg.cfg
> check_puppet_disabled.cfg
> check_raid.cfg
> $
>
> The challenge is finding out which commands are on the instances:
>
>  - I am pretty sure nrpe does not let you list available commands
>  - from LDAP you only have the role class and have no clue which defines
> have been run on the target instance
>
> :(
>
> --
> Antoine "hashar" Musso
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l



More information about the Labs-l mailing list