Hello,
We currently have ganglia wich give usefully reporting about server
status. What about notifications when something is about to go wrong
(disk usage at 95%, lot of errors in memcached, too many slow queries,
server suddenly swapping ..).
Kate wrote servmon that gave (give?) usefull informations about server
but I am personally to lazy to hack that.
Recently I finally found a job, part of my tasks is to setup a
monitoring tool. My choice ? Nagios. It's an open source monitoring tool
that I have setup on larousse some months ago. I asked avar and mark
their though about having a monitoring tool, their answer was: sure!
So let's start with Nagios.
Nagios is still on larousse although it is not running at the momment. I
could easily upgrade it to lastest version (2.0b4), tweak the config
files to add the new servers (something like 60+ new friends).
We will have to choose a server to run nagios on. Larousse seems to be a
good choice as it is mostly idling, serve pages for
http://noc.wikimedia.org/ and got used for servmon. Larousse could
become THE monitoring device (and eventually move ganglia from zwinger
to larousse).
Next step is to agree on a way to check services on the various hosts.
There is several solution for that:
1/ run a daemon on each server (nrpe), listening to queries from the
monitoring host and giving back results.
2/ hack something that grab data from gmetad and add new metric plugins
to ganglia. The good point is that we will then have those data showing
in ganglia.
3/ make checks through ssh ussing passwordless ssh-key. I personally
dont like that.
4/ deploy snmp everywhere
The nrpe approach need to setup a daemon on each server. Problem, most
of the data are already available through gmetad. The good point is that
it is easy to setup (rpm -i nrpe , same config files and plugins for
every servers).
Reusing gmetad data is probably a better idea, the data in nagios and
ganglia would be the same. One of the problems is that we will have to
code a nagios plugin that cache the gmetad data to avoid multiples
queries (we probably dont want to query gmetad for cpu, then for memory
then for nfs call, then for each disk space usage).
SNMP is a great tool for grabing devices status. Again it s probably
redundant with gmetad but will let us monitor network equipment such as
the switches, our ISP router and probably the console switch.
cheers,
--
Ashar Voultoiz - WP++++
http://en.wikipedia.org/wiki/User:Hashar
http://www.livejournal.com/community/wikitech/
IM: hashar(a)jabber.org ICQ: 15325080