Hello,
This email is intended to WikiMedia servers administrators.
I installed and configured Nagios (
http://www.nagios.org/ ) wich is a
website to monitor host and services as well as a tool to plan downtime
and comment about service shutdowns.
It can be accessed at:
http://zwinger.wikipedia.org/~hashar/nagios/
Login and password are in our zwinger:doc directory.
Nagios is split in 3 part:
1/ web engine (using cgi)
2/ a daemon
3/ plugins to check services (like check_http, check_ping)
web engine and daemon are currently installed in my home dir on zwinger.
Plugins should be spread on all machines to be able to check for local
services through ssh, wich is not really done right now. Actually it
just test ssh, http, ping, squid availability.
Server can be assigned to a hostgroup. Each group can be assigned a
contact-group (ex: database administrators) with several contacts (shai,
brion, jeronim ..). When a service / host goes down, a notification is
raised to the group, we can even have notification enabled only during
work-hour !
Let's have a tour at the interface:
'Tactical Overview' let you have a global report of what's actually down
through the network, be it disk usage > 90% , host down, ssh not answering.
'Host detail' let you know which servers are up/down while 'Host
service' give you a more detailed view (like http on alrazi is down). In
each view you will see when the host/service got last checked.
'Status overview' aggregate server view per hostgroups (mysql-servers,
apache-servers ...).
The most interesting feature is being able to add comment for every
outage and aknowledge a trouble (for example when you are working on
it). You can try it by having a look at Yang switch for which I briefly
explain why it's an outage. We can even stop monitoring this host.
Notification is available through plugins, actually email is somehow
available but we can have messages sent to pager or on irc :o)
Another feature is the ability to plan downtime window, during that
window, the host will no more be monitored and no notification sent.
Also have a look at the various reporting form availables. It's a good
way to find which host have recurrent troubles.
Basicly I would like, if possible, to have every one look at the
interface and play with it a bit then come back to me with their
impression. If we are interested in having this tool available I will
write a doc to install it cleanly on the foundation servers :o)
cheers,
--
Ashar Voultoiz - WP++++
http://en.wikipedia.org/wiki/User:Hashar