Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Summary

https://nagios02.slac.stanford.edu/

What is Nagios?

Nagios is an open-source monitoring tool. It is used at SLAC to automatically watch key hosts and services, and to contact appropriate personnel when/if they go down.

The primary SLAC Nagios instance is run by SCS. The web interface is available at https://nagios02.slac.stanford.edu/.

How do I use this service?

Please contact =unix-admin@slac.stanford.edu= to discuss adding your hosts and services to the central SLAC Nagios service.

Nagios for Users

Checks

Each host/service check is performed by invoking a command on the nagios server. These checks are generally extremely light-weight and specific:

  • Is this host pingable?
  • Can I reach the web port on this box?
  • What is the system load on this host, and is it within a certain range?

Each check is run every 5-10 minutes (configurable), and returns one of three values - 'OK', 'WARNING', or 'CRITICAL' - as well as some explanatory text. The return values are used to determine the current state of the service. Anything other than 'OK' will eventually trigger alerts.

If a problem is found, the Nagios server will immediately schedule another check; if the problem continues, then Nagios will mark the service as 'WARNING' or 'CRITICAL', and send out an appropriate alert to the pre-defined parties. Alerts will continue to be sent periodically (about once an hour) until the service recovers; at that point, a 'RECOVERY' alert is sent.

As of the time of writing, the nagios server runs ~6000 of these checks. This is likely to grow.

Acknowledging Alerts and Scheduling Downtime

By acknowledging an alert for a host or service, you can keep nagios from sending out notifications for the problem (although it will continue to probe the host/service). You don't have to do this, but it will keep your email load lighter, as well as tell other interested parties that somebody is looking into problems.

Similarly, there is a method to schedule downtime for hosts or services. If you know that you're about to reboot a system, and don't want to be notified, you can schedule 10 minutes of downtime with nagios; if you know that you're about to tinker with a service, you may want to schedule a few hours of downtime.

Alerts are acknowledged and downtimes are scheduled from the command-line via remctl.   remctl is a Kerberos-based client-server protocol that provides authenticated per-user/principle access to specific backend commands; in this case, the back end is a locally-maintained script that allows users to interact with nagios directly.  This generally works as so:

This generally works as so:

...