Hello,
We currently have ganglia wich give usefully reporting about server status. What about notifications when something is about to go wrong (disk usage at 95%, lot of errors in memcached, too many slow queries, server suddenly swapping ..).
Kate wrote servmon that gave (give?) usefull informations about server but I am personally to lazy to hack that.
Recently I finally found a job, part of my tasks is to setup a monitoring tool. My choice ? Nagios. It's an open source monitoring tool that I have setup on larousse some months ago. I asked avar and mark their though about having a monitoring tool, their answer was: sure!
So let's start with Nagios.
Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).
We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).
Next step is to agree on a way to check services on the various hosts. There is several solution for that:
1/ run a daemon on each server (nrpe), listening to queries from the monitoring host and giving back results.
2/ hack something that grab data from gmetad and add new metric plugins to ganglia. The good point is that we will then have those data showing in ganglia.
3/ make checks through ssh ussing passwordless ssh-key. I personally dont like that.
4/ deploy snmp everywhere
The nrpe approach need to setup a daemon on each server. Problem, most of the data are already available through gmetad. The good point is that it is easy to setup (rpm -i nrpe , same config files and plugins for every servers).
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).
SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.
cheers,
Ashar Voultoiz wrote:
Hello,
We currently have ganglia wich give usefully reporting about server status. What about notifications when something is about to go wrong (disk usage at 95%, lot of errors in memcached, too many slow queries, server suddenly swapping ..).
Kate wrote servmon that gave (give?) usefull informations about server but I am personally to lazy to hack that.
Recently I finally found a job, part of my tasks is to setup a monitoring tool. My choice ? Nagios. It's an open source monitoring tool that I have setup on larousse some months ago. I asked avar and mark their though about having a monitoring tool, their answer was: sure!
So let's start with Nagios.
Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).
We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).
Next step is to agree on a way to check services on the various hosts. There is several solution for that:
1/ run a daemon on each server (nrpe), listening to queries from the monitoring host and giving back results.
2/ hack something that grab data from gmetad and add new metric plugins to ganglia. The good point is that we will then have those data showing in ganglia.
3/ make checks through ssh ussing passwordless ssh-key. I personally dont like that.
4/ deploy snmp everywhere
The nrpe approach need to setup a daemon on each server. Problem, most of the data are already available through gmetad. The good point is that it is easy to setup (rpm -i nrpe , same config files and plugins for every servers).
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).
SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.
cheers,
I would be VERY interested in nagios-ganglia integration. I recommend that you contact the nagios and ganglia teams about this. I think a lot of people have wanted this.
Ashar Voultoiz wrote:
Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).
We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).
Yes. Although larousse is getting old, and the install running on it is too. We might want to do a reinstall before that.
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).
I don't know ganglia too well, but this seems like the best option to investigate. If ganglia is flexible and uncomplicated enough to add new metrics easily, then this could certainly work.
SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.
Can we use SNMP for devices that support it, and use ganglia for the rest?
In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...
Mark Bergsma wrote: <snip>
Yes. Although larousse is getting old, and the install running on it is too. We might want to do a reinstall before that.
Fedora Core 2 actually, should I ask Solar to upgrade it to FC3 so ? :o)
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).
I don't know ganglia too well, but this seems like the best option to investigate. If ganglia is flexible and uncomplicated enough to add new metrics easily, then this could certainly work.
Tim Starling added a metric for NFS server calls. So we can probably add some more stuff.
Can we use SNMP for devices that support it, and use ganglia for the rest?
In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...
In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly. Debian got some nice examples in the snmd.conf .
Can we start installing snmpd on all servers to at least get some basic data ? :o)
cheers,
Ashar Voultoiz wrote:
In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly.
It's basically the same thing with ganglia metrics.
Ivan Krstic wrote:
It's basically the same thing with ganglia metrics.
Good! Now I am RTFM :o)
Ashar Voultoiz wrote:
Fedora Core 2 actually, should I ask Solar to upgrade it to FC3 so ? :o)
It's not Solar doing that, and it's already scheduled. Problem is, we need to move stuff off it first.
In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...
In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly. Debian got some nice examples in the snmd.conf .
Yes, I have done so too, and I always found it a pain. The interface isn't exactly good/flexible.
Can we start installing snmpd on all servers to at least get some basic data ? :o)
That's exactly the same data ganglia is currently monitoring, so I don't really see the point...
Mark Bergsma wrote: <snip>
Can we start installing snmpd on all servers to at least get some basic data ? :o)
That's exactly the same data ganglia is currently monitoring, so I don't really see the point...
So lets write ganglia scripts :o)
If we want to monitor every minute 15 services, we will have to telnet the gmetad every 2 seconds. We could build a caching system though:
Check gmetad, cache the result for one minute, the have the nagios plugins grep the cache instead of telneting gmetad.
I think i have an idea about how to handle that.
Ashar Voultoiz wrote:
Mark Bergsma wrote:
<snip>
Can we start installing snmpd on all servers to at least get some basic data ? :o)
That's exactly the same data ganglia is currently monitoring, so I don't really see the point...
So lets write ganglia scripts :o)
If we want to monitor every minute 15 services, we will have to telnet the gmetad every 2 seconds. We could build a caching system though:
Check gmetad, cache the result for one minute, the have the nagios plugins grep the cache instead of telneting gmetad.
I think i have an idea about how to handle that.
I wrote a perl script a while back to poll the gmond XML output from one machine and stop or start a process on another machine based on the value of a metric retrieved. I didn't use telnet (ick), I read from a socket and then used an XPath module to find the metric in the XML. It's probably lying around in my home directory somewhere if you want to look at it.
If caching is required, then adding metrics to nagios is obviously not the same as adding metrics to ganglia. For ganglia, you run gmetric whenever a metric changes, so you can have a loop that sets 30 metrics in each pass if you like. You don't give it a plugin for it to invoke at its leisure, you make your own daemon.
-- Tim Starling
Tim Starling wrote: <snip>
I wrote a perl script a while back to poll the gmond XML output from one machine and stop or start a process on another machine based on the value of a metric retrieved. I didn't use telnet (ick), I read from a socket and then used an XPath module to find the metric in the XML. It's probably lying around in my home directory somewhere if you want to look at it.
Hello,
I have wrote a little perl plugin for nagios that would let us grab a given metric for a given host. The plugin implements caching of gmetad data, caching of xml parse and handle warning / critical threshold.
I have also give a little configuration example for nagios checkcommands.cfg.
Files are in /home/hashar/gmeta-nagios/ :
cbg_commands.cfg The nagios configuration for the plugin check_by_gmetad.pl The plugin itself, use perl gmetad-cache.stor XML Parse cache gmetad-cache.xml XML grabbed from the gmetad host.
So now, it is just pending for a FC3 larousse upgrade and a nagios compile.
If caching is required, then adding metrics to nagios is obviously not the same as adding metrics to ganglia. For ganglia, you run gmetric whenever a metric changes, so you can have a loop that sets 30 metrics in each pass if you like. You don't give it a plugin for it to invoke at its leisure, you make your own daemon.
I will set up a basic nagios installation first, then we can work on metrics. Memcached instances / errors might be interesting as well as mysql replication lag for slaves.
cheers,
wikitech-l@lists.wikimedia.org