cluster monitoring

List overview All Threads
Download

newer

older

some logo updated

About tables templates in wikimedia

Ashar Voultoiz

10 Nov 2005 10 Nov '05

8:16 p.m.

Hello,

We currently have ganglia wich give usefully reporting about server status. What about notifications when something is about to go wrong (disk usage at 95%, lot of errors in memcached, too many slow queries, server suddenly swapping ..).

Kate wrote servmon that gave (give?) usefull informations about server but I am personally to lazy to hack that.

Recently I finally found a job, part of my tasks is to setup a monitoring tool. My choice ? Nagios. It's an open source monitoring tool that I have setup on larousse some months ago. I asked avar and mark their though about having a monitoring tool, their answer was: sure!

So let's start with Nagios.

Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).

We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).

Next step is to agree on a way to check services on the various hosts. There is several solution for that:

1/ run a daemon on each server (nrpe), listening to queries from the monitoring host and giving back results.

2/ hack something that grab data from gmetad and add new metric plugins to ganglia. The good point is that we will then have those data showing in ganglia.

3/ make checks through ssh ussing passwordless ssh-key. I personally dont like that.

4/ deploy snmp everywhere

The nrpe approach need to setup a daemon on each server. Problem, most of the data are already available through gmetad. The good point is that it is easy to setup (rpm -i nrpe , same config files and plugins for every servers).

Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).

SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Show replies by date

Jason Edgecombe

10 Nov 10 Nov

11:37 p.m.

Ashar Voultoiz wrote:

...

Hello,

We currently have ganglia wich give usefully reporting about server status. What about notifications when something is about to go wrong (disk usage at 95%, lot of errors in memcached, too many slow queries, server suddenly swapping ..).

Kate wrote servmon that gave (give?) usefull informations about server but I am personally to lazy to hack that.

Recently I finally found a job, part of my tasks is to setup a monitoring tool. My choice ? Nagios. It's an open source monitoring tool that I have setup on larousse some months ago. I asked avar and mark their though about having a monitoring tool, their answer was: sure!

So let's start with Nagios.

Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).

We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).

Next step is to agree on a way to check services on the various hosts. There is several solution for that:

1/ run a daemon on each server (nrpe), listening to queries from the monitoring host and giving back results.

2/ hack something that grab data from gmetad and add new metric plugins to ganglia. The good point is that we will then have those data showing in ganglia.

3/ make checks through ssh ussing passwordless ssh-key. I personally dont like that.

4/ deploy snmp everywhere

The nrpe approach need to setup a daemon on each server. Problem, most of the data are already available through gmetad. The good point is that it is easy to setup (rpm -i nrpe , same config files and plugins for every servers).

Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).

SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.

cheers,

I would be VERY interested in nagios-ganglia integration. I recommend that you contact the nagios and ganglia teams about this. I think a lot of people have wanted this.

Mark Bergsma

11 Nov 11 Nov

7:20 p.m.

Ashar Voultoiz wrote:

...

Nagios is still on larousse although it is not running at the momment. I could easily upgrade it to lastest version (2.0b4), tweak the config files to add the new servers (something like 60+ new friends).

We will have to choose a server to run nagios on. Larousse seems to be a good choice as it is mostly idling, serve pages for http://noc.wikimedia.org/ and got used for servmon. Larousse could become THE monitoring device (and eventually move ganglia from zwinger to larousse).

Yes. Although larousse is getting old, and the install running on it is too. We might want to do a reinstall before that.

...

Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).

I don't know ganglia too well, but this seems like the best option to investigate. If ganglia is flexible and uncomplicated enough to add new metrics easily, then this could certainly work.

...

SNMP is a great tool for grabing devices status. Again it s probably redundant with gmetad but will let us monitor network equipment such as the switches, our ISP router and probably the console switch.

Can we use SNMP for devices that support it, and use ganglia for the rest?

In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...

-- Mark mark@nedworks.org

Ashar Voultoiz

13 Nov 13 Nov

1:56 p.m.

Mark Bergsma wrote: <snip>

...

Yes. Although larousse is getting old, and the install running on it is too. We might want to do a reinstall before that.

Fedora Core 2 actually, should I ask Solar to upgrade it to FC3 so ? :o)

...

...
Reusing gmetad data is probably a better idea, the data in nagios and ganglia would be the same. One of the problems is that we will have to code a nagios plugin that cache the gmetad data to avoid multiples queries (we probably dont want to query gmetad for cpu, then for memory then for nfs call, then for each disk space usage).

I don't know ganglia too well, but this seems like the best option to investigate. If ganglia is flexible and uncomplicated enough to add new metrics easily, then this could certainly work.

Tim Starling added a metric for NFS server calls. So we can probably add some more stuff.

...

Can we use SNMP for devices that support it, and use ganglia for the rest?

In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...

In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly. Debian got some nice examples in the snmd.conf .

Can we start installing snmpd on all servers to at least get some basic data ? :o)

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Ivan Krstic

2:04 p.m.

Ashar Voultoiz wrote:

...

In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly.

It's basically the same thing with ganglia metrics.

-- Ivan Krstic krstic@fas.harvard.edu | 0x147C722D

Ashar Voultoiz

14 Nov 14 Nov

7:03 p.m.

Ivan Krstic wrote:

...

It's basically the same thing with ganglia metrics.

Good! Now I am RTFM :o)

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Mark Bergsma

13 Nov 13 Nov

4:39 p.m.

Ashar Voultoiz wrote:

...

Fedora Core 2 actually, should I ask Solar to upgrade it to FC3 so ? :o)

It's not Solar doing that, and it's already scheduled. Problem is, we need to move stuff off it first.

...

...
In my experience, SNMP is nice and easy for things that the standard net-snmpd supports, but it gets nasty beyond that, i.e. if you want to add things yourself...

In my experience, adding new measures in snmp is easy. You have to add the script that return data, then it generates the oid accordingly. Debian got some nice examples in the snmd.conf .

Yes, I have done so too, and I always found it a pain. The interface isn't exactly good/flexible.

...

Can we start installing snmpd on all servers to at least get some basic data ? :o)

That's exactly the same data ganglia is currently monitoring, so I don't really see the point...

-- Mark mark@nedworks.org

Ashar Voultoiz

14 Nov 14 Nov

7:03 p.m.

Mark Bergsma wrote: <snip>

...

...
Can we start installing snmpd on all servers to at least get some basic data ? :o)

That's exactly the same data ganglia is currently monitoring, so I don't really see the point...

So lets write ganglia scripts :o)

If we want to monitor every minute 15 services, we will have to telnet the gmetad every 2 seconds. We could build a caching system though:

Check gmetad, cache the result for one minute, the have the nagios plugins grep the cache instead of telneting gmetad.

I think i have an idea about how to handle that.

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Tim Starling

7:53 p.m.

Ashar Voultoiz wrote:

...

Mark Bergsma wrote:

<snip>

...
...
Can we start installing snmpd on all servers to at least get some basic data ? :o)

That's exactly the same data ganglia is currently monitoring, so I don't really see the point...

So lets write ganglia scripts :o)

If we want to monitor every minute 15 services, we will have to telnet the gmetad every 2 seconds. We could build a caching system though:

Check gmetad, cache the result for one minute, the have the nagios plugins grep the cache instead of telneting gmetad.

I think i have an idea about how to handle that.

I wrote a perl script a while back to poll the gmond XML output from one machine and stop or start a process on another machine based on the value of a metric retrieved. I didn't use telnet (ick), I read from a socket and then used an XPath module to find the metric in the XML. It's probably lying around in my home directory somewhere if you want to look at it.

If caching is required, then adding metrics to nagios is obviously not the same as adding metrics to ganglia. For ganglia, you run gmetric whenever a metric changes, so you can have a loop that sets 30 metrics in each pass if you like. You don't give it a plugin for it to invoke at its leisure, you make your own daemon.

-- Tim Starling

Ashar Voultoiz

20 Nov 20 Nov

5:17 p.m.

Tim Starling wrote: <snip>

...

I wrote a perl script a while back to poll the gmond XML output from one machine and stop or start a process on another machine based on the value of a metric retrieved. I didn't use telnet (ick), I read from a socket and then used an XPath module to find the metric in the XML. It's probably lying around in my home directory somewhere if you want to look at it.

Hello,

I have wrote a little perl plugin for nagios that would let us grab a given metric for a given host. The plugin implements caching of gmetad data, caching of xml parse and handle warning / critical threshold.

I have also give a little configuration example for nagios checkcommands.cfg.

Files are in /home/hashar/gmeta-nagios/ :

cbg_commands.cfg The nagios configuration for the plugin check_by_gmetad.pl The plugin itself, use perl gmetad-cache.stor XML Parse cache gmetad-cache.xml XML grabbed from the gmetad host.

So now, it is just pending for a FC3 larousse upgrade and a nagios compile.

...

If caching is required, then adding metrics to nagios is obviously not the same as adding metrics to ganglia. For ganglia, you run gmetric whenever a metric changes, so you can have a loop that sets 30 metrics in each pass if you like. You don't give it a plugin for it to invoke at its leisure, you make your own daemon.

I will set up a basic nagios installation first, then we can work on metrics. Memcached instances / errors might be interesting as well as mysql replication lag for slaves.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

6819

Age (days ago)

6829

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Ashar Voultoiz
Ivan Krstic
Jason Edgecombe
Mark Bergsma
Tim Starling