Hello,
We currently have ganglia wich give usefully reporting about server status. What about notifications when something is about to go wrong (disk usage at 95%, lot of errors in memcached, too many slow queries, server suddenly swapping ..).
Kate wrote servmon that gave (give?) usefull informations about server but I am personally to lazy to hack that.
Recently I finally found a job, part of my tasks is to setup a monitoring tool. My choice ? Nagios. It's an open source monitoring tool that I have setup on larousse some months ago. I asked avar and mark their though about having a monitoring tool, their answer was: sure!
So let's start with Nagios.
Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).
We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).
Next step is to agree on a way to check services on the various hosts. There is several solution for that:
1/ run a daemon on each server (nrpe), listening to queries from the monitoring host and giving back results.
2/ hack something that grab data from gmetad and add new metric plugins to ganglia. The good point is that we will then have those data showing in ganglia.
3/ make checks through ssh ussing passwordless ssh-key. I personally dont like that.
4/ deploy snmp everywhere
The nrpe approach need to setup a daemon on each server. Problem, most of the data are already available through gmetad. The good point is that it is easy to setup (rpm -i nrpe , same config files and plugins for every servers).
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).
SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.
cheers,