You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

Overview

On Februrary 14, 2012 Fermi nagios notification was migrated over to the SCS nagios system. Access privileges are required to use the commands in this document.

This documentation is intended to apprise someone, new to Nagios, about how to work within SCS Nagios.  This documentation is intended to be used only as a starting guide; therefore, this documentation is not extensive and focuses on only the features of Nagios available to Nagios administrators for Fermi.  This documentation will outline the general procedure for completing a Nagios related task, such as

  • Schedule Downtime For Host or Service
  • Where exactly the Nagios software resides
  • Procedure to modify the Nagios configuration for a host or service
  • View Nagios configuration in SVN

Nagios has extensive documentation available online (we are using Nagios 3.2.3) Nagios Core Version 3.x Documentation and  Nagios Documentation (Library, Guides and Links)


What is Nagios?

Nagios is a network, and system, monitoring application.  Essentially, Nagios watches specified hosts, and services, then alerts you when something goes wrong.  Slac nagios is currently using Nagios core 3.2.3


Motivation

The Fermi Gamma-ray Telescope mission relies heavily on computers and network services for tasks, such as Science Analysis Software development, ground processing and data handling, data storage, etc.. The Fermi team heavily depends on machines, and services, at SLAC, Goddard Space Flight Center, and other participating institutions.   Therefore, team-members need to immediately know whenever equipment, services, or hosts, have any problems.  Nagios 3.x is the open source host, and service, monitoring software that we are using to monitor hosts and services, and alert team-members about specified host and service events.


Prerequisites

  • login access to nagios02.slac.stanford.edu
  • read access to the following directory, and executables, on nagios02.slac.stanford.edu
    • /etc/nagios and all subdirectories
    • /usr/lib64/nagios/plugins
  • * privileges to run remctrl

Basic Commands

As a nagios administrator for Fermi you have basically three commands you can send to nagios using remctl (the port is always 46157):

  • schedule - schedule a check on a host/service
  • downtime -  a host/service is known to be down, don't contact the contacts for a while
  • acknowledge - Alert for host/service has been acknowledged so stop sending alert notices

Command

Format

Example

acknowledge
service alert

remctl -p <port> <nagios-server-name>  nagios <cmd> <host or service> <hostname> <service name> <comment>

remctl -p 46157 nagios02 nagios ack service wain007 xroot-wain007  'xrootd daemon off temporarily'

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To acknowledge an alert for a service:

remctl -p 46157 nagios02 nagios ack service wain007 xroot-wain007 'xrootd daemon off temporarily'


Basic Details

All the configuration files for Scientific Computing reside under /etc/nagios/sca/ on nagios02.slac.stanford.edu. Within this directory are a number of subdirectories on a per-monitoring-type basis. These are:

  • fermi
  • cdms
  • exo

Within these directories are at least two files, a host.cfg (list of hosts and host groups) and a service.cfg (listing services and service groups). Inheritance is heavily used in order to avoid writing the same service checks or host directives repeatedly for each individual host. A generic template of basic services (e.g. ping, ssh) is inherited by all host files first and then the specific service directives are added on top of that. 

  • No labels