[Labs-l] nagios monitoring via snmp / puppet freshness checks / ec2-metadata

Sat Mar 31 08:41:21 UTC 2012

Hi,

you may have noticed that Labs Nagios reports "Puppet freshness" as
CRIT for all instances.

First a little background on that, in the end the current problem
left, if you want to skip technical details.

In production these checks are implemented as "passive checks" (stuff
gets reported TO Nagios instead of Nagios asking remote hosts).

Passive checks, while more complex to setup, have the general
advantage that Nagios just needs to passively sit there and receive
results from hosts instead of opening connections
to all hosts all the time. This can be implemented f.e. via the NSCA
(Nagios Service Check Acceptor) daemon or via snmp.

While we currently use both methods in production, the puppet
freshness check is implemented via snmp-traps.

On the client (instance) side we have this, puppetized in base.pp

exec { "puppet snmp trap":
.. command => "snmptrap -v 1  ..etc...

This lets all puppet agents execute snmtrap after a puppet run.
snmptrap uses arguments including the snmp community string "public",
an snmp OID, and the Nagios hostname, and actively sends it out to
Nagios.

One of the reasons for this to fail was the hostname being hardcoded
to "nagios.wikimedia.org".

So in base.pp I added an "if $realm == "labs" and turned that into
${nagios_host} to set it to just "nagios" for labs, after that i could
see incoming traps on the Nagios host, using "tcpdump port 162".
(gerrit change 3988)

On the server / nagios there are  (snmpd), snmptrapd and snmptt. The
configs we use for this are in /etc/snmp/ (/files/snmp in puppet).
snmtrapd is the one listening to the incoming traps, it is configured
to then call "snmptt" as the "traphandle default".
snmptt then uses "EXEC
/usr/local/nagios/libexec/eventhandlers/submit_check_result".

submit_check_result is a Nagios command that "fakes" a check_result on
the Nagios itself, it finally writes to the "nagios.cmd" command file,
which is a named pipe.
Once Nagios sees this coming in you can see "PASSIVE SERVICE CHECK"
result lines in tail -f nagios.log.

Next step was the path to this Nagios command file differed from
production. In ./eventhandlers/submit_check_result , i changed the
CommandFile path. (/var/log/nagios in prod vs.  /var/lib/nagios3 in
labs). I would like it if we could use the same pathes as in
production for the Nagios configs to avoid these manual fixes.

But this wasn't it yet, so i compared the running snmp* processes to
production. Though snmptt was running fine, it turned snmptrapd was
not or with different options, i am not 100% sure anymore. Anyways,
once i started it like seen on spence:   /usr/sbin/snmptrapd -On -Lsd
-p /var/run/snmptrapd.pid  i could finally see incoming check results
in nagios.log.
(Petan, thanks for setting those up, but maybe you wanna check for
those options, i just started that _manually_ but we should test how
it looks after a reboot.)

Now there is just a tiny problem left :P  The hostnames mismatch. So
Nagios gets all the results, but in nagios.log you will see these:

Warning:  Passive check result was received for service 'Puppet
freshness' on host 'i-000000f8', but the host could not be found

This is why: The full command the instances use to send out the traps
is: command => "snmptrap -v 1 -c public ${nagios_host}
.1.3.6.1.4.1.33298 `hostname` 6 1004 `uptime | awk '{
split(\$3,a,\":\"); print (a[1]*60+a[2])*60 }'`",

See how `hostname` is being used in there. This simply works in
production because production hosts return the same string for
hostname that Nagios uses to define the hosts it knows about. On labs
though, hostname returns the resource name (f.e. i-000000f8), while
Nagios uses the "nice" instance names (f.e. "venus", "wikistats-01"
etc.)

So the options were: Give me a command that returns the instance name
on an instance itself (as opposed to asking the controller) OR change
Nagios to use the resource names as hostnames. Since I don't think we
really want Nagios to report that "i-000000f8 is DOWN" i tried adding
the other name as an alias to a Nagios host definition. This didnt
work either though, Nagios does not appear to match against the host
aliases here.

So when trying to find out if it is even possible for an instance to
know it's own instance name with a local command, Andrew Bogott
pointed me to this (thanks!:):

http://aws.amazon.com/code/1825 (EC2 Instance Metadata Query Tool ),
quoting Andrew " labs runs on openstack which is theoretically
API-compatible with Amazon's EC2.  Hence that being an amazon page."

That looked really promising so i tested it on an instance, and indeed
it does work and can return all kinds of info.
Try "./ec2-metadata --all" after just wget'ing it and making it executable.

Among these are:

instance-id: i-000000ea
local-hostname: i-000000ea
public-hostname: i-000000ea
public-ipv4: 208.80.153.223

but unfortunately i still don't see the "hostname" we want.:/

So if you have an idea how to get that right nice hostname from the
instance itself, please tell me about it, or feel free to just add the
final fix in:

base.pp (test branch) lines 93 - 100. It needs to keep using
`hostname` in production replaced by _something_ else if $realm is
labs.

Regards,

-- 
--
Daniel Zahn <dzahn at wikimedia.org>