Overview

On Februrary 14, 2012 Fermi nagios notification system was migrated over to the SCS nagios system. Access privileges are required to use the commands in this document.

This documentation is intended to apprise someone, new to Nagios, about how to work within SCS Nagios.  This documentation is intended to be used only as a starting guide; therefore, this documentation is not extensive and focuses on only the features of Nagios available to Nagios administrators for Fermi.  This documentation will outline the general procedure for completing a Nagios related task, such as

Nagios has extensive documentation available online (we are using Nagios 3.2.3) Nagios Core Version 3.x Documentation and  Nagios Documentation (Library, Guides and Links)


What is Nagios?

Nagios is a network, and system, monitoring application.  Essentially, Nagios watches specified hosts, and services, then alerts you when something goes wrong.  Slac nagios is currently using Nagios core 3.2.3


Motivation

The Fermi Gamma-ray Telescope mission relies heavily on computers and network services for tasks, such as Science Analysis Software development, ground processing and data handling, data storage, etc.. The Fermi team heavily depends on machines, and services, at SLAC, Goddard Space Flight Center, and other participating institutions.   Therefore, team-members need to immediately know whenever equipment, services, or hosts, have any problems.  Nagios 3.x is the open source host, and service, monitoring software that we are using to monitor hosts and services, and alert team-members about specified host and service events.


Prerequisites

  • login access to nagios02.slac.stanford.edu
  • read access to the following directory, and executables, on nagios02.slac.stanford.edu
    • /etc/nagios and all subdirectories
    • /usr/lib64/nagios/plugins
  • * privileges to run remctrl

What is remctl?

remctl (the client) and remctld (the server) implement a client/server protocol for running single commands on a remote host using Kerberos v5 authentication and returning the output. They use a very simple GSS-API-authenticated network protocol, combined with server-side ACL support and a server configuration file that maps remctl commands to programs that should be run when that command is called by an authorized user.


Remctl Commands

As a nagios administrator for Fermi you have basically three commands you can send to nagios using remctl (the port is  4373 for now).

Commands are:

  • schedule - schedule a check on a host/service
  • downtime -  a host/service is known to be down, don't contact the contacts for a while
  • acknowledge - Alert for host/service has been acknowledged so stop sending alert notices

Basic formats for host or service are (*fill in the uppercase words with desired command, host or service and comment. You may shorten the nagios02.slac.stanford.edu address to nagios02)

remctl -p 4373 nagios02.slac.stanford.edu nagios COMMAND host HOSTNAME COMMENT

remctl -p 4373 nagios02.slac.stanford.edu nagios COMMAND host HOSTNAME HOURS COMMENT

remctl -p 4373 nagios02.slac.stanford.edu nagios COMMAND host HOSTNAME MINUTES COMMENT

remctl -p 43737 nagios02.slac.stanford.edu nagios COMMAND service HOSTNAME SERVICENAME COMMENT

remctl -p 4373 nagios02.slac.stanford.edu  nagios COMMAND service HOSTNAME SERVICENAME HOURS COMMENT

remctl -p 4373 nagios02.slac.stanford.edu nagios COMMAND service HOSTNAME SERVICENAME MINUTES COMMENT

(green star) You can also issue the remctl command without the port number


Examples

remctl -p 4373 nagios02  nagios ack host wain007 'there is a problem, we are working on it'

remctl -p 4373 nagios02  nagios ack service wain007 xroot-wain007 'ack - RT 9999'

remctl -p 4373 nagios02.slac.stanford.edu nagios downtime host wain007 5 'down for 5 hours'

remctl -p 4373 nagios02.slac.stanford.edu nagios downtime service sulky46 u02-diskspace 5 'service down for 5 hrs'

remctl -p 4373 nagios02.slac.stanford.edu nagios schedule host sulky46 5 'run checks in 5 mins'

remctl -p 4373 nagios02.slac.stanford.edu nagios schedule service sulky46 u02-diskspace 5 'check disk u02 in 5 mins'

remctl -p 4373 nagios02  nagios help

remctl -p 4373 nagios02  nagios man


Basic Details

All the configuration files for Scientific Computing reside under /etc/nagios/sca/ on nagios02.slac.stanford.edu. Within this directory are a number of subdirectories on a per-monitoring-type basis. These are:

  • fermi
  • cdms
  • exo

Within these directories are at least two files, a host.cfg (list of hosts and host groups) and a service.cfg (listing services and service groups). Inheritance is heavily used in order to avoid writing the same service checks or host directives repeatedly for each individual host. A generic template of basic services (e.g. ping, ssh) is inherited by all host files first and then the specific service directives are added on top of that. The service definitions can be found under  /etc/nagios/conf.d/sca/fermi/www

Sidebar

To modify the sidebar you must check out the code from svn. The command is

svn checkout file://localhost/afs/slac.stanford.edu/g/scs/svn/systems/nagios-fermi-web

The sidebar is a temporary solution so SCS can work on groundworks.

362696
378581

  • No labels