[Labs-l] nagios monitoring via snmp / puppet freshness checks / ec2-metadata

Sat Mar 31 08:46:08 UTC 2012

I will fix it :P

On Sat, Mar 31, 2012 at 10:41 AM, Daniel Zahn <dzahn at wikimedia.org> wrote:
> Hi,
>
> you may have noticed that Labs Nagios reports "Puppet freshness" as
> CRIT for all instances.
>
> First a little background on that, in the end the current problem
> left, if you want to skip technical details.
>
> In production these checks are implemented as "passive checks" (stuff
> gets reported TO Nagios instead of Nagios asking remote hosts).
>
> Passive checks, while more complex to setup, have the general
> advantage that Nagios just needs to passively sit there and receive
> results from hosts instead of opening connections
> to all hosts all the time. This can be implemented f.e. via the NSCA
> (Nagios Service Check Acceptor) daemon or via snmp.
>
> While we currently use both methods in production, the puppet
> freshness check is implemented via snmp-traps.
>
> On the client (instance) side we have this, puppetized in base.pp
>
> exec { "puppet snmp trap":
> .. command => "snmptrap -v 1  ..etc...
>
> This lets all puppet agents execute snmtrap after a puppet run.
> snmptrap uses arguments including the snmp community string "public",
> an snmp OID, and the Nagios hostname, and actively sends it out to
> Nagios.
>
> One of the reasons for this to fail was the hostname being hardcoded
> to "nagios.wikimedia.org".
>
> So in base.pp I added an "if $realm == "labs" and turned that into
> ${nagios_host} to set it to just "nagios" for labs, after that i could
> see incoming traps on the Nagios host, using "tcpdump port 162".
> (gerrit change 3988)
>
> On the server / nagios there are  (snmpd), snmptrapd and snmptt. The
> configs we use for this are in /etc/snmp/ (/files/snmp in puppet).
> snmtrapd is the one listening to the incoming traps, it is configured
> to then call "snmptt" as the "traphandle default".
> snmptt then uses "EXEC
> /usr/local/nagios/libexec/eventhandlers/submit_check_result".
>
> submit_check_result is a Nagios command that "fakes" a check_result on
> the Nagios itself, it finally writes to the "nagios.cmd" command file,
> which is a named pipe.
> Once Nagios sees this coming in you can see "PASSIVE SERVICE CHECK"
> result lines in tail -f nagios.log.
>
> Next step was the path to this Nagios command file differed from
> production. In ./eventhandlers/submit_check_result , i changed the
> CommandFile path. (/var/log/nagios in prod vs.  /var/lib/nagios3 in
> labs). I would like it if we could use the same pathes as in
> production for the Nagios configs to avoid these manual fixes.
>
> But this wasn't it yet, so i compared the running snmp* processes to
> production. Though snmptt was running fine, it turned snmptrapd was
> not or with different options, i am not 100% sure anymore. Anyways,
> once i started it like seen on spence:   /usr/sbin/snmptrapd -On -Lsd
> -p /var/run/snmptrapd.pid  i could finally see incoming check results
> in nagios.log.
> (Petan, thanks for setting those up, but maybe you wanna check for
> those options, i just started that _manually_ but we should test how
> it looks after a reboot.)
>
> Now there is just a tiny problem left :P  The hostnames mismatch. So
> Nagios gets all the results, but in nagios.log you will see these:
>
> Warning:  Passive check result was received for service 'Puppet
> freshness' on host 'i-000000f8', but the host could not be found
>
> This is why: The full command the instances use to send out the traps
> is: command => "snmptrap -v 1 -c public ${nagios_host}
> .1.3.6.1.4.1.33298 `hostname` 6 1004 `uptime | awk '{
> split(\$3,a,\":\"); print (a[1]*60+a[2])*60 }'`",
>
>
> See how `hostname` is being used in there. This simply works in
> production because production hosts return the same string for
> hostname that Nagios uses to define the hosts it knows about. On labs
> though, hostname returns the resource name (f.e. i-000000f8), while
> Nagios uses the "nice" instance names (f.e. "venus", "wikistats-01"
> etc.)
>
> So the options were: Give me a command that returns the instance name
> on an instance itself (as opposed to asking the controller) OR change
> Nagios to use the resource names as hostnames. Since I don't think we
> really want Nagios to report that "i-000000f8 is DOWN" i tried adding
> the other name as an alias to a Nagios host definition. This didnt
> work either though, Nagios does not appear to match against the host
> aliases here.
>
> So when trying to find out if it is even possible for an instance to
> know it's own instance name with a local command, Andrew Bogott
> pointed me to this (thanks!:):
>
> http://aws.amazon.com/code/1825 (EC2 Instance Metadata Query Tool ),
> quoting Andrew " labs runs on openstack which is theoretically
> API-compatible with Amazon's EC2.  Hence that being an amazon page."
>
> That looked really promising so i tested it on an instance, and indeed
> it does work and can return all kinds of info.
> Try "./ec2-metadata --all" after just wget'ing it and making it executable.
>
> Among these are:
>
> instance-id: i-000000ea
> local-hostname: i-000000ea
> public-hostname: i-000000ea
> public-ipv4: 208.80.153.223
>
> but unfortunately i still don't see the "hostname" we want.:/
>
> So if you have an idea how to get that right nice hostname from the
> instance itself, please tell me about it, or feel free to just add the
> final fix in:
>
> base.pp (test branch) lines 93 - 100. It needs to keep using
> `hostname` in production replaced by _something_ else if $realm is
> labs.
>
> Regards,
>
> --
> --
> Daniel Zahn <dzahn at wikimedia.org>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l