Overview

This documentation is intended to apprise someone, new to Nagios, about how to maintain Nagios.  This documentation is intended to be used only as a starting guide; therefore, this documentation is not extensive and focuses on only the primary features of Nagios.  This documentation will outline the general procedure for completing a Nagios related task, such as

  • How/Where Nagios requests are made
  • Where exactly the Nagios software resides
  • How to modify the Nagios configuration
  • How to backup the Nagios configuration (via CVS)

Nagios has extensive documentation available online (we are using Nagios 2.x) http://support.nagios.com/knowledge-base/official-documentation.


What is Nagios?

Nagios is a network, and system, monitoring application.  Essentially, Nagios watches specified hosts, and services, then alerts you when something goes wrong.  We are currently using Nagios 2.x.


Motivation

The Fermi Gamma-ray Telescope mission relies heavily on computers and network services for tasks, such as Science Analysis Software development, ground processing and data handling, data storage, etc.. The Fermi team heavily depends on machines, and services, at SLAC, Goddard Space Flight Center, and other participating institutions.   Therefore, team-members need to immediately know whenever equipment, services, or hosts, have any problems.  Nagios 2.x is the open source host, and service, monitoring software that we are using to monitor hosts and services, and alert team-members about specified host and service events.


Prerequisites

  • Access to glastlnx02 machine
  • JIRA Account(contact Tony Johnson)
    • Request that JIRA emails you when requests are made to you
  • * "sudo" accessto the following directory, and executables, on glastlnx02.slac.stanford.edu
    • /usr/bin/nagios (exectuable)
    • /etc/init.d/nagios (exectuable)
    • /usr/lib64/nagios/plugins/ (directory)
  • Read and Write permissions to /etc/nagios

* To request "sudo" access, first get permission/confirmation from Tony Johnson, then send an email request to the unix admin at SLAC. 


Nagios Maintenance Procedure

In general, completing a Nagios related task can be summarized in the following three steps

  1. A request is made to monitor a host or service
  2. A Nagios configuration change is then made
  3. The Nagios configuration changes are committed to the CVS repository
How Will You Be Notified about a Nagios Request?

In general, your requests will come through either of the following

  • JIRA
  • Email
  • AIM/ICQ
  • Direct-human contact
    • Someone behind you could directly tap you on the shoulder or ask you in person (stay alert!)
How to Make a Nagios Configuration Change?

In general, this requires modifying a Nagios configuration file, or adding a new configuration file for Nagios.  Instructions on how to do this depend on the task.  In general, most tasks result in just creating a new object definition, which tells nagios configuration information such as what to montior, how often to check the service, and who to notify. 

After changes to the configuration files are made, first verify the configuration files, then restart Nagios via command-line.

  1. Verify configuration

    sudo /usr/bin/nagios -v /etc/nagios/nagios.cfg
    

         Resolve errors, if any, then proceed to reload Nagios.

     2.  Reload Nagios

 sudo /etc/init.d/nagios reload
or
 sudo /sbin/service nagios restart (for nagios-test only)

Nagios Basics

At the very basic level...

  1. Something needs to be monitored
  2. Somebody needs to notified when something goes wrong

An example scenario...

Someone makes a request (maybe through JIRA, or ICQ, or even email)...

Dear Nagios-admin:

Please monitor the u05 nfs disk space.

Now the above request will likely seem vague to the beginner, but this is natural.  With experience, you will learn to read minds, but please do seek clarification when needed.  Now there are probably many questions that you have from the above request, but the key questions are

  1. What host does the u05 disk reside on?   (Answer:  In this case, I just queried for the u05 disk via the "df" command and got the value from under the "Filesystem" listing)
  2. How to tell Nagios to perform the check?  (Answer:  Create a service definition and store it in the configuration file, named after the host for which the service belongs to)
  3. Who should gets notified about this service?  (Answer:  Unless otherwise specified, the "core" contact-group)

Since u05 is an nfs disk (as inferred from the request), let's find its host via the command line (note that the path to the nfs disks is /nfs/farm/g/glast/):

user@glastlnx02$ df /nfs/farm/g/glast/u05

Filesystem           1K-blocks      Used Available Use% Mounted on
sulky46:/g.glast.u05 196213184 182682240  11568800  95% /a/sulky46/g.glast.u05

Under "Filesystem," we find the host name, sulky46. 

Q:  So what do we do with the information about the host name?

A:  Our convention is to include a service definition in a Nagios configuration file named after the host that the service belongs to.  So for example, if the u05 disk resides on sulky46, then we create (if it does not already exist) a file named sulky46.cfg, and place the service definition, for the u05 disk space checks, in that file

What's the business about "service definitions?"  I will explain the details later, but the objective here is to give you the big picture.  Here are the steps to complete the task

  1. Check to see if there exist a Nagios configuration file for the host that u05 resides on (if there is no Nagios configuration file for u05's host, then create one)
  2. Add a service definition for the "u05 disk space" to the configuration file named after those host that u05 resides on
  3. Add service description to the appropriate servicegroup in servicegroup.cfg
  4. Check that the configuration is syntactically correct
  5. *Test that the new service definition is correct
  6. Reload Nagios

Let's go through each step together...

1. Check to see if there exist a Nagios configuration file for the host that u05 resides on

Fortunately, we already have a Nagios configuration file for the host, sulky46:  /etc/nagios/hosts/hosts/nfs_servers/sulky46.cfg

2. Add a service definition to sulky46.cfg

Here is the service definition to check the u05 disk space and issue a WARNING alert if the disk space has less than 5% free space, and issue a CRITICAL alert if less than 2% disk space is available:

   define service{
   use                                 generic-service
   host_name                           sulky46.slac.stanford.edu
   service_description                 u05 disk space
   is_volatile                         0
   check_period                        24x7
   max_check_attempts                  4
   normal_check_interval               5
   retry_check_interval                1
   contact_groups                      core
   notification_options                w,u,c,r
   notification_interval               960
   notification_period                 24x7
   check_command                       check_nfs_disk!5!2!/nfs/farm/g/glast/u05
   notifications_enabled               1
 }

The left column contains the directives, and the right column contains their values.  The values for the directives above in red generally vary with the service definition, and depend on the host name, and service you are creating.  The values for the directives above in green generally stay the same per service definition unless someone requests something different.

To get a detailed description of the directives, and what they mean, see the Nagios 2.x documentation, section, "Template-Based Object Configuration," here http://nagios.sourceforge.net/docs/2_0/xodtemplate.html.  I will describe the main directives above, which are in red.

contact_groups

This directive is used to specify the contact groups to be notified about this service.  The contact_groups value is set to "core."  This means that members of the "core" contact group will receive service alerts for this particular service, "u05 disk space."  All contacts and contact groups are defined in /etc/nagios/contacts.cfg.  The format for the contact, and contact-group, definitions are generally the same for all object definitions, but the available directives depend on which object definition you are using (E.g., service group, host or hostgroup, commands, etc.).

In the file, /etc/nagios/contacts.cfg, we can see the definition for the "core" contact-group defintion:

define contactgroup{
contactgroup_name       core
alias                   Core Group
members                 nagios, Navid, Richard, Heather, Karen, Tom, ecephas
}

The above members will all receive notifications about this service, "u05 disk space."

check_command

This directive is the most important directive, and without it, a service definition is useless.  This directive specifies the command to be used to perform the actual service check.  The parameters for the command are separated by the "!" character. 

The value of a check_command is an identifier associated with an actual program. In this case, the identifier is "check_nfs_disk."  A check_command is created by making a "command_definition," in which the association between an identifier and an actual program is established.  For example, the definition for the check_command, "check_nfs_disk," is

define command{
        command_name    check_nfs_disk
        command_line    $USER5$/check_disk_nfs -w $ARG1$ -c $ARG2$ -p $ARG3$
              }

In the above example command definition for check_nfs_disk, the value for $USER5$ is /etc/nagios/homemade_plugins.

Let's look back at the check_command value for the "u05 disk" service: 

                          check_nfs_disk!5!2!/nfs/farm/g/glast/u05.

At run-time, the command-line translation would be: 

/etc/nagios/homemade_plugins/check_disk_nfs -w 5 -c 2 -p /nfs/farm/g/glast/u05.

$USERn$ macros

  • Defined in /etc/nagios/resource.cfg, and are used in command definitions in your host config files
  • Useful for storing sensitive information such as usernames, passwords, etc..

$ARGn$ macros

  • Refer to parameter(s) passed to the check-command in a service definition
  • The nth argument passed to the command (notification, event handler, service check, etc.)
    • Nagios supports up to 32 argument macros ($ARG1$ through $ARG32$)

host_name

This directive is used to specify the short name of the host that the service "runs" on or is associated with

service_description

This directive is used to define the description of the service, which may contain spaces, dashes, and colons

3.  Add service description to the appropriate servicegroup in servicegroup.cfg

For the sake of organization we try to arrange the various services into servicegroups, which can be found in /etc/nagios/servicegroups.cfg.  Since the "u05 disk space" service performs a disk space check on an NFS disk, the following member is added to the "Glast-NFS Space" service group: 

                         members                 sulky46.slac.stanford.edu,u05 disk space

For the most part, the context of the service check will give you a sense of what service group it should associated with.  The file /etc/nagios/servicegroups.cfg contains all the service groups. 

There is another way to add a service to a service group.  Instead of modifying the file, /etc/nagios/servicegroups.cfg, you can directly add the service to a service group from within the actual service definition by using the optional "servicegroups" directive.  For example:

   define {
   use                              generic-service
   host_name                        sulky46.slac.stanford.edu
   service_description              u05 disk space
   servicegroups                    Glast-NFS Space
   is_volatile                      0
   check_period                     24x7
   max_check_attempts               4
   normal_check_interval            5
   retry_check_interval             1
   contact_groups                   core
   notification_options             w,u,c,r
   notification_interval            960
   notification_period              24x7
   check_command                    check_nfs_disk!5!2!/nfs/farm/g/glast/u05
   notifications_enabled            1
   }

4.   Check that the configuration is syntactically correct

On the command-line, enter

sudo /usr/bin/nagios -v /etc/nagios/nagios.cfg

5.   Test that the new service definition is correct

Nagios provides no explicit utility to test that a configuration is logically correct.  To prevent false-alarms, there are two techniques that I use to test a new configuration (feel free to invent better techniques, but these work well):

  1. Run the command on the command-line.
  2. Disable notifications for the service, reload Nagios, and view the results of the service check on the Nagios web interface.  Once the output on the Web Interface seems correct, then enable notifications for the service/host definition.

(info) Keep in mind that in most cases, you will not know whether the output is actually correct, but in some cases it can be clear that there is a problem.   For example, let's say you wanted to test a service that checks for database connectivity.   You probably assumed that you did not need a password for the check, but after running the test on the command-line, the program reports some error about an invalid password.  As a result, you now know to contact someone to get the correct password for the database.

 

Testing from the command-line

Each check_command is associated with an actual program.  You can find the path to the actual program by examining a check_command's actual command-definition in /etc/nagios/commands.cfg.  Then just run it at the command-line to make sure there are no problems with the output.

Testing by disabling notifications

After the command-line test, a final check should be to set the service definition directive, "notifications_enabled," to 0, then reload Nagios.  For example:

define {
   use                              generic-service
   host_name                        sulky46.slac.stanford.edu
   service_description              u05 disk space
   is_volatile                      0
   check_period                     24x7
   max_check_attempts               4
   normal_check_interval            5
   retry_check_interval             1
   contact_groups                   core
   notification_options             w,u,c,r
   notification_interval            960
   notification_period              24x7
   check_command                    check_nfs_disk!5!2!/nfs/farm/g/glast/u05
   notifications_enabled            0
   }

This way, Nagios will report the check results on the Nagios Web Interface (http://glastlnx02.slac.stanford.edu/nagios), but notifications will not be sent out.  After the service check results appear correct, then the next task is to enable notifications for that service by setting the "notificaitons_enabled" directive to 1, then reload Nagios. 

6.  Reload Nagios

On the command-line, enter

sudo /etc/init.d/nagios reload

Nagios 2.x Configuration Overview

Our Nagios configuration resides on the machine, glasltnx02.slac.stanford.eduObject data is a generic term used to describe various data definitions you need in order to monitor anything.  Object definitions are defined in a template format.  For the complete definition, syntax, and available directives for each kind of Object Definition, see the Nagios 2.x documentation here.

The following table lists the files/directories in which object data is defined.

Type of Object Definition
Notes
Location (on glastlnx02)

Service

These definitions are normally contained in the host configuration files, named after the host in which the service is associated with.  There is an exception:

  • One service is associated with multiple hosts
    In this case, I just put the service definition in a single configuration file with a name indicative of the service.  E.g., if the service, "Disk Spaces," is associated with multiple hosts, then the file would be named something like, /etc/nagios/hosts/disk-spaces-service.cfg. The point is, just give the configuration file a meaningul name

/etc/nagios/hosts

Service Groups

Optionally, a service definition can register itself as a member of a servicegroup via the <servicegroups> directive within a service definition. 

/etc/nagios/servicegroups.cfg

Hosts

A host definition is used to define a physical server, workstation, device, that resides on your network.  The host configuration file is named after the host.  For example, the host configuration file for glastlnx02.slac.stanford.edu is /etc/nagios/hosts/glastlnx02.cfg.

/etc/nagios/hosts/

Host Groups

Used to group one or more hosts together for display purposes in the CGIs.

/etc/nagios/hostgroup.cfg

Contacts

Used to identify someone who should be contacted in the event of a problem on your network

/etc/nagios/contacts.cfg

Contact Groups

Used to group one or more contacts together for the purpose of sending out alert/recovery notifications.

/etc/nagios/contacts.cfg

Commands

These configuration files are well documented so you can learn a great deal from just opening them and viewing their contents (wink)

/etc/nagios/command-plugins.cfg
/etc/nagios/command.cfg

Time periods

A list of times during various days that are considered to be "valid" times for notifications and service checks.  This configuration file is well documented so you can learn a great deal from just viewing its contents (wink)

/etc/nagios/time.cfg

Service Escalations

Completely optional and are used to escalate notifications for a particular service.  Not currently being used, but feel free to try it if you find it useful.

N/A

Service Dependencies

Advanced feature of Nagios that allow you to suppress notifications and active checks of services based on the status of one or more other services.  Not currently being used, but feel free to try it if you find it useful.

N/A

Host Escalations

Optional, and used to escalate notifications for a particular host.  Not currently being used, but feel free to try it if you find it useful.

N/A

Host Dependencies

An, optional, advanced feature of Nagios that allows you to suppress notifications for hosts based on the status of one or more other hosts. 

N/A

Extended Host Information

Used to make the output from the status, statusmap, statuswrl, and extinfo look neat. They have no effect on monitoring and are completely optional. Store Extended Host Information in the corresponding host's configuration file.

/etc/nagios/hosts/

Extended Service Information

Optional, and used to make the output from the status and extinfo CGIs look neat.  This is rarely used and up to your discretion about when to use them.  In any case, if you decide to use an Extended Service Information, then add it to the configuration file which contains the corresponding service definition. 

/etc/nagios/hosts/

How Nagios Finds the Configuration Files

On the SLAC machine, glastlnx02, there is a main configuration file (/etc/nagios/nagios.cfg) where you specify the configuration files/directories that contain object definitions, by using the following two directives

  • cfg_file=<file_name>
  • cfg_dir=<directory_name>

So for example, in /etc/nagios/nagios.cfg, you will find

cfg_file=/etc/nagios/commands.cfg
cfg_file=/etc/nagios/template.cfg
cfg_dir=/etc/nagios/hosts/
cfg_file=/etc/nagios/hostgroups.cfg
cfg_file=/etc/nagios/servicegroups.cfg
cfg_file=/etc/nagios/contacts.cfg
cfg_file=/etc/nagios/time.cfg

Main Configuration File

The main configuration file is /etc/nagios/nagios.cfg.   Without it, Nagios is clueless.  The main configuration file is responsible for setting global options, such as the Host Check Timeout, which is the maximum number of seconds that Nagios will allow host checks to run, and the Timing Interval Length, which is the maximum number of seconds that Nagios will allow host checks to run.

For the most part, the main configuration file is already setup.  On rare occassions, you may want to make changes, or enhancements.  The list of options avaialble for the main configuration file can be found in the Nagios 2.x documentation here.

Nagios Plugins

Plugins are the programs/scripts (Perl, shell, etc.) that can be run from a command line to check the status, host, or service.  Nagios is useless without plugins.  Nagios uses the results from plugins to determine the current status or hosts and services on your network. 

Where are our Nagios plugins?

We have two kinds of plugins, which are listed in the following table.

Type of Plugin

Description

Location (On glastlnnx02)

Standard

These plugins were already developed by external developers, and downloaded from http://sourceforge.net/projects/nagiosplug/

/usr/lib64/nagios/plugins

Custom

These are plugins that we/you develop.

/etc/nagios/homemade_plugins

To determine what a plugin does, just run the plugin on the command line with the "-h" option.  For example,

foo@glastlnx02 $ /usr/lib64/nagios/plugins/check_disk -h
check_disk v1848 (nagios-plugins 1.4.11)
Copyright (c) 1999 Ethan Galstad <nagios@nagios.org>
Copyright (c) 1999-2006 Nagios Plugin Development Team
        <nagiosplug-devel@lists.sourceforge.net>

This plugin checks the amount of used disk space on a mounted file system
and generates an alert if free space is less than one of the threshold values


Usage: check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device}
[-C] [-E] [-e] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ]
[-t timeout] [-u unit] [-v] [-X type]

Options:
 -h, --help
    Print detailed help screen
 -V, --version
    Print version information
 -w, --warning=INTEGER
    Exit with WARNING status if less than INTEGER units of disk are free
 -w, --warning=PERCENT%
    Exit with WARNING status if less than PERCENT of disk space is free
 -c, --critical=INTEGER
    Exit with CRITICAL status if less than INTEGER units of disk are free
 -c, --critical=PERCENT%
    Exit with CRITCAL status if less than PERCENT of disk space is free
 -W, --iwarning=PERCENT%
    Exit with WARNING status if less than PERCENT of inode space is free
 -K, --icritical=PERCENT%
    Exit with CRITICAL status if less than PERCENT of inode space is free
 -p, --path=PATH, --partition=PARTITION
    Path or partition (may be repeated)
 -x, --exclude_device=PATH <STRING>
    Ignore device (only works if -p unspecified)
 -C, --clear
    Clear thresholds
 -E, --exact-match
    For paths or partitions specified with -p, only check for exact paths
 -e, --errors-only
    Display only devices/mountpoints with errors
 -g, --group=NAME
    Group pathes. Thresholds apply to (free-)space of all partitions together
 -k, --kilobytes
    Same as '--units kB'
 -l, --local
    Only check local filesystems
 -L, --stat-remote-fs
    Only check local filesystems against thresholds. Yet call stat on remote filesystems
    to test if they are accessible (e.g. to detect Stale NFS Handles)
 -M, --mountpoint
    Display the mountpoint instead of the partition
 -m, --megabytes
    Same as '--units MB'
 -A, --all
    Explicitly select all pathes. This is equivalent to -R '.*'
 -R, --eregi-path=PATH, --eregi-partition=PARTITION
    Case insensitive regular expression for path/partition (may be repeated)
 -r, --ereg-path=PATH, --ereg-partition=PARTITION
    Regular expression for path or partition (may be repeated)
 -I, --ignore-eregi-path=PATH, --ignore-eregi-partition=PARTITION
    Regular expression to ignore selected path/partition (case insensitive) (may be repeated)
 -i, --ignore-ereg-path=PATH, --ignore-ereg-partition=PARTITION
    Regular expression to ignore selected path or partition (may be repeated)
 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)
 -u, --units=STRING
    Choose bytes, kB, MB, GB, TB (default: MB)
 -v, --verbose
    Show details for command-line debugging (Nagios may truncate output)
 -X, --exclude-type=TYPE
    Ignore all filesystems of indicated type (may be repeated)

Examples:
 check_disk -w 10% -c 5% -p /tmp -p /var -C -w 100000 -c 50000 -p /
    Checks /tmp and /var at 10% and 5%, and / at 100MB and 50MB
 check_disk -w 100M -c 50M -C -w 1000M -c 500M -g sidDATA -r '^/oracle/SID/data.*$'
    Checks all filesystems not matching -r at 100M and 50M. The fs matching the -r regex
    are grouped which means the freespace thresholds are applied to all disks together
 check_disk -w 100M -c 50M -C -w 1000M -c 500M -p /foo -C -w 5% -c 3% -p /bar
    Checks /foo for 1000M/500M and /bar for 5/3%. All remaining volumes use 100M/50M

Send email to nagios-users@lists.sourceforge.net if you have questions
regarding use of this software. To submit patches or suggest improvements,
send email to nagiosplug-devel@lists.sourceforge.net

(plus) In general, you develop a custom plugin if you cannot find a pre-existing plugin that accomplishes the kind of check that you desire. 

Creating a Custom Plugin

Creating a Nagios plugin is fairly simple.  There are three basic steps

  1. Create a program, or script, that performs the check and conforms to the following guidelines:  http://nagiosplug.sourceforge.net/developer-guidelines.html
  2. Store the program in /etc/nagios/homemade_plugins
  3. Create a command definition for the custom plugin

Store the path to the program in a $USERx$ (see /etc/nagios/resource.cfg) variable.  I have arbitrarily chosen the $USER5$ variable to point to /etc/nagios/homemade_plugins.

How do I obtain plugins?

You can find additional plugins at http://sourceforge.net/projects/nagiosplug/.  I recommend that you first check to see if they are availabe in /usr/lib64/nagios/plugins, or /etc/nagios/homemade_plugins/.

How to make Nagios Use the Plugin?

All you need to do is create a command definition in /etc/nagios/commands.cfg.  You can find documentation for how to do so in the file /etc/nagios/commands.cfg.  The command definition associates a name with an actual exectuable.  Here is an example command definition

 define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
        }

The macro, $HOSTADDRESS$, refers to the host address for the host, or service, for which the check-command is used.  The value of $USER1$ is /usr/lib64/nagios/plugins (see /etc/nagios/resource.cfg).   Here is an example host definition that uses the command "check-host-alive" command

define host{
         use                     generic-host            ; Name of host template to use
         host_name               glastlnx02.slac.stanford.edu
         alias                   glastlnx02
         address                 134.79.200.89
         check_command           check-host-alive
         max_check_attempts      10
         check_period            24x7
         notification_interval   120
         notification_period     24x7
         notification_options    d,r
         contact_groups          core
 }
NRPE 

Some Nagios plugins are only intended to check local resources (disk space, CPU load, etc.).  For example, let's say you want to check the disk space usage for the /tmp directory on glastlnx05.  Well there is a Nagios plugin, check_disk, on glastlnx02, but here are the problems

  1. check_disk only checks local resources, so if you set the input parameter to "/tmp," then it will check assume you are referring to the /tmp directory of the machine for which the command resides
  2. You could put a copy of the check_disk command on glastlnx05, but then the problem is getting the results from glastlnx05 (remote machine) to Nagios on glastlnx02

The NRPE addon is the facility that will allow you to execute Nagios plugins on remote Linux/Unix machines, and communicate the result to Nagios as if the plugin is being run locally. 

How to use NRPE on Machines at SLAC?

You do not need to worry about configuring the NRPE addon.  All you need to do is the following

  1. Email the Unix-admins at SLAC and request that the NRPE daemon be enabled for the "check_disk" command (for this example), on glastlnx05, to check the disk usage for /tmp
  2. Create the service definition, using the following check-command:  check_nrpe!check_disk

Notice that there are no thresholds supplied to the "check_disk" plugin when it is passed to the "check_nrpe" command.   When you email the Unix-admins, they will normally discuss with you the thresholds that will be used for the particular command, and the Unix-admins will set the thresholds on their end.

(info) Tip:  Some Nagios plugins are used to check local resources, but it is difficult to tell by their names.   Command definitions (see /etc/nagios/commands.cfg) have been created to provide more meaningful command-names for those Nagios plugins that are for local resources (local to the machine that Nagios is running on, i.e., glastlnx02):

  • check_local_disk
  • check_local_load
  • check_local_procs
  • check_local_users

How to use NRPE on Remote Machines?

If the machines to be monitored are behind a firewall (e.g. on Stanford campus) and the host running Nagios is outside of that firewall it will be a challenge. One solution is to poke a tiny hole in the firewall filter that allows Nagios to make calls to the nrpe daemon running on the host machine behind the firewall.  We don't have any actual experience for this case yet so the information is limited. For more details refer to indirect host checking.

Notifications

The service check, and host check, logic makes the decision to send notifications out for a host, or service, check.  Nagios allows you to control when, and how, notifications get sent. 

(info) One directive in the main configuration file, /etc/nagios/nagios.cfg, overrides the decision to send notifications for ALL hosts/services:  enable_notifications.  This option determines whether or not Nagios will send out notifications when Nagios (re)starts.  If enable_notifications=0, then Nagios will NOT send out any notifications, despite the results from the host/service checks.

Several filters must be passed before notifications are sent:  Program wide -> Service/Host -> Contact.   The details can be found in the Nagios 2.x documentation, section "Notifications," here

Program wide filter

The <enable_notifications> option can be set in the main configuration file (/etc/nagios/nagios.cfg) to enable/disable notifications on the program-wide basis.  If notifications are disabled here, then no notifications are sent at ALL.  If notifications are not enabled, then they must pasts the Service/Host filter...

Service and Host Filters

Several filters must be passed within the scope of Service and Host notifications

  1. scheduled downtime
    1. If service or host is in a scheduled downtime, no one gets notified
    2. Notifications for services are suppressed if the host they're associated with is in a period of scheduled downtime
  2. flapping(if you enabled flap detection)
    1. If the service, or host, is flapping, no notifications will be sent
  3. notification options
    1. the <notification_options> directive in the Host, and Service, definition specifies what kinds of Notifications can be sent (i.e., CRITICAL, WARNING, etc..)
      1. If you exclude any notification type, then Nagios will send no alerts for that type
  4. Time period test
    1. Each host, and service, definition has a <notification_period> directive that specifies which time period contains valid notification times for the host or service
    2. If the time a notification is about to be made, falls outside the <notification_period>, NO notifications are sent
    3. If the time a notification is about to be made, falls within the <notification_period>, notifications are sent

Contact Filter

After the first set of filters (program wide, service & host) are passed, each contact has its own filter that a notification must pass through.   Within a contact definition, you can separately specify the kinds of notifications that you can receive from hosts, or services, via the following directives

  • <service_notification_options>
    • possible values:  w, u, c, r
  • <host_notification_options>
    • possible values:  d, r

Notifications for a host, or service, are sent if the notification type is enabled by the corresponding directives listed above. 

What kind of Notifications are available?

In general, anything you can do from a command-line can be tailored to send notifications.  Here is a list of possible notification methods

  • Email
  • ICQ
  • Phone (SMS)

(plus) Email is our primary notification method.   A notification method is implemented by a command definition.    Inside a contact definition, you specify the kind of host/service notification method via the following two directives

  1. service_notification_commands
  2. host_notification_commands

For example, the contact definition for Mr. Foo would be ...

 define contact{
        contact_name                    ecephas
        alias                           Emmanuel Cephas
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,r
        host_notification_options       n
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-email
        email                           ecephas@slac.stanford.edu
}

I will explain the other directives in the next section (Contact Definitions).  You can find the command definitions for various notifications (epager, qpage, an SMS client, etc.) in /etc/nagios/command-plugins.cfg. 

When do Notifications get sent?

Host and service notifications get sent out in the following situations

  • A Hard State change occurs
  • Host remains in Hard Non-OK State and the <notification_interval>(a service/host definition directive) time has elapsed since the last check

    (lightbulb) Tip:  If you want to prevent notifications from getting sent out more than once for any given problem, then set the <notification_interval> value to 0.

The host, and service, definition directive, notification_interval, is used to define the number of "time units" to wait before re-notifying a contact that this service is still in a non-OK state.  The default value of interval_length is 60, therefore the integer value of notification_interval is interpreted as a multiple of 60 seconds.  So in the following example, the "core" contact group will be notified every 16 hours (960 * 60 seconds) in the event that the "u05 disk space" is still in a HARD Non-OK state:

 define {
   use                              generic-service
   host_name                        sulky46.slac.stanford.edu
   service_description              u05 disk space
   is_volatile                      0
   check_period                     24x7
   max_check_attempts               4
   normal_check_interval            5
   retry_check_interval             1
   contact_groups                   core
   notification_options             w,u,c,r
   notification_interval            960
   notification_period              24x7
   check_command                    check_nfs_disk!5!2!/nfs/farm/g/glast/u05
   notifications_enabled            0
   }

State Types

It is important to understand the state types (hard and soft) because Nagios uses the state type to determine if, and how, to respond to a check result.   I am going to provide a brief overview.  For more details, see section "State Types," in the Nagios 2.x documentation http://nagios.sourceforge.net/docs/2_0/statetypes.html.  The current state of a host or service is determined by two components

  1. Status of the service or host check (i.e., UP, DOWN, CRITICAL, etc.)
  2. Type of state:  hard or soft

(tick) The <max_check_attempts> directive in a service, or host, is the crucial factor that determines a state's type (hard or soft).  Once the service, or host, remains in the same non-OK state (i.e., CRITICAL, UNKNOWN, DOWN, WARNING) after <max_check_attempts> has been exceeded, then the state becomes Hard
(info) See the Nagios 2.x documentation, section "Template Based Object Configuration" http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#service for a description of <max_check_attempts>.

For example, whether notifications are sent, or resent, depend on the state type.  Occasionally you will be reading through the documentation and it may tell you that Nagios only does some particular operation if the service, or host, enters a specific state type (i.e., HARD or SOFT).  The different state types are ...

  • HARD state
  • HARD state change
  • HARD state event
  • SOFT state
  • SOFT state event

Soft States

Soft states occur when a service or host check

  • Results in a non-OK state and it has not yet been (re)checked the number of times specified by the <max_check_attempts> option in the service or host definition
  • Recovers from a soft state (called a soft recovery)

    Soft state events (What you need to know about Soft states)

    • Logging
    • Event handlers
      •  Executed (if you defined any) to handle the soft error or recovery for the service or host
    • Notifications are NOT sent

Hard states

Hard states occur when in the following situations

  • When a host check
    • Results in a non-OK state and it has been (re)checked the number of times specified by the <max_check_attempts> option in the service definition (this is a HARD error)
    • Recovers from a hard error state (a Hard Recovery)
  • When a service check
    • Results in a non-OK state and it has been (re)checked the number of times specified by the <max_check_attempts> option in the service definition (this is a HARD error)
    • Recovers from a hard error state (a Hard Recovery)
    • Results in a non-OK state and its corresponding host is DOWN or UNREACHABLE

Hard state change

Hard state changes occur when a service or host changes from a

  • hard OK state to a hard non-OK state
  • non-OK state to a hard OK-state
  • hard non-OK state of some kind to a hard non-OK state of another kind (i.e. from a hard WARNING state to a hard UNKNOWN state)

    Hard state events (What you need to know about Hard states)

    • Logging
    • Event handlers
      •  Executed (if you defined any) to handle the hard error or recovery for the service or host
    • Notifications are sent (if permitted by the notification logic)

Contacts

In order for a person to receive any kind of notification, you need to create a contact definition for that person.  Nagios 2.x allows you to specify a contact group to receive notifications for a particular host, or service.   Each contact group is made up of members of contacts. 

Documentation about how to create a contact definition can be found here.

What you need to know specific about our setup

  • We place our contact, and contact group, definitions in the file /etc/nagios/contacts.cfg
  • You can view the current contacts, and contact groups, in two places
    1. The file /etc/nagios/contacts.cfg
    2. The web interface.
      1. In the left pane under the "Configuration" section, click "View Config"
      2. Now select "Contacts" in the drop menu in the right pane

(lightbulb) The Nagios mailing list (nagios@glast.stanford.edu) is a mailing list that a person can sign up for to receive specific notifications (down, critical, recovery) about

  •  What is it?** Mailing list that a person can sign up for to receive specific notifications (down, critical, recovery) about specific hosts and services*** Which hosts and services?**** the host, Goddard (GSFC web server), "Ping" service
        • All other non-ISOC hosts/servcies
        • To see the list of hosts, and service, see Web Interface, "Hostgroup Overview," or "Service Overview

Contact Groups

A contact group is used to group together multiple contacts into a single identifier for the purpose of sending notifications.  For more details, see http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#contactgroup.&nbsp; To create a contact group, you need to create a contact group definition.

Where are our contact group definitions located?

Both contact definitions and contact group definitions are located on the computer, glastlnx02, in the file /etc/nagios/contacts.cfg. 

How to view our contact groups?

You can view our contact groups in two ways

  1. View the contents /etc/nagios/contacts.cfg
  2. View from the Web Interface
    1. In the left pane under the "Configuration" section, click "View Config"
    2. Now select "Contact Groups" in the drop menu in the right pane

The Web Interface is more ideal for viewing the details about contact groups, because the additional, associated, configuration information (i.e., Service Notification Period, Service Notification Command, etc.) is centralized in one place.

Our Current Setup (as of July 7, 2009)

The following table shows which contact groups get notified about which hosts and services.  You can find out the members of a contact group by following the directions from the previous section ("How to view contact groups").

contact group(s)

Host(s) / Service(s) Notifications

Notes

core

All non ISOC hosts and services (see Web Interface, "Hostgroup Overview," or "Service Overview") and the "Ping" service for the "Goddard" host.

 

flightops

ISOC hosts and services (see Web Interface, "Hostgroup Overview," or "Service Overview" for list of ISOC hosts and services)

 

systests

u17 disk space on sulky27

The core group also watches u17 disk space.

*core_no_nagios,critical_core

ALL Ping services EXCEPT the Ping service for the host, "Goddard."

*This combination of contact groups satisfies the logic required for the core group (with the exception of the Nagios mailing list) to receive ALL notifications types (WARNING, UNKNOWN, CRITICAL, etc.)

warren_focke

hosts:  sulky27 and sulky39

notifications for warning/unknown states for u15, u52.

How to create a contact group definition?

A contact group definition is just another kind of Nagios object definition.  Documentation for creating contact group definitions can be found here http://nagios.sourceforge.net/docs/2_0/xodtemplate.html.&nbsp;

(plus) Our convention is to keep the contact, and contact group, definitions in the file /etc/nagios/contacts.cfg on the computer, glastlnx02

For example, the following contact group definition specifies the "core" contact group

 define contactgroup{
        contactgroup_name       core
        alias                   Core Group
        members                 nagios, Navid, Richard, Heather, Karen, Tom
}

How to use the contact group definition?

Now if you want to send out notifications about a host, or service, to members of a contact group, then set the contact_groups directive to "core," in a host or service definition.  For example...

 define service{
         use                             generic-service         ; Name of service template to use
         host_name                       foo.slac.stanford.edu
         service_description             Ping
         is_volatile                     0
         check_period                    24x7
         max_check_attempts              4
         normal_check_interval           5
         retry_check_interval            1
         contact_groups                  core
         notification_options            w,u,c,r
         notification_interval           960
         notification_period             24x7
         check_command                   check_ping!100.0,20%!500.0,60%
 }

Documentation for creating any kind of Nagios object definition can be found in the Nagios 2.x documentation, section "Template-Based Object Configuration" (see http://nagios.sourceforge.net/docs/2_0/xodtemplate.html).

Special contact groups for our Nagios setup

You can only tell Nagios to notify a contact "group" about a particular service, or host, event.  In this version of Nagios (Nagios 2.x), you cannot directly specify a single contact to be notified about a service, or host, event.  For example, in the "Ping" service definition above, instead of notifying the entire contact group about the Ping service, let's say we only want to notify a single member (i.e., Tom) of the "core" contact group.  Well there is no directive available, in Nagios 2.x, to specify a single contact (though this is possible in Nagios 3.x).   If you want a host or service notification to go to a single person, this is what you do..

(info)   If you want a host or service notification to go to a single contact, just create a contact group whose member is that single contact.  Then just specify that contact group.

So, in this example, since we want to send notifications to only Tom, then here is what we do...  Create a contact group whose only member is Tom...

 define contactgroup{
        contactgroup_name       Tom_group
        alias                   Group for Tom
        members                 Tom
}

Now we change the value for the <contact_groups> directive in the "Ping" service... 

 define service{
         use                             generic-service         ; Name of service template to use
         host_name                       foo.slac.stanford.edu
         service_description             Ping
         is_volatile                     0
         check_period                    24x7
         max_check_attempts              4
         normal_check_interval           5
         retry_check_interval            1
         contact_groups                  Tom_group
         notification_options            w,u,c,r
         notification_interval           960
         notification_period             24x7
         check_command                   check_ping!100.0,20%!500.0,60%
 }

Now the above service definition will send notifications about the "Ping" service to Tom. 

Adding A New Host to the Configuration

What to name the file?

Our convention is to name the file after the name of the host (E.g., For the host, glastlnx02.slac.stanford.edu, name the file /etc/nagios/hosts/glastlnx02.cfg).

What standard host, and service, definitions should be added to the configuration file

If you are adding a new host to the Nagios configuration, our convention is to add two standard service (Ping & SSH) definitions, unless glastlnx02 legitimately cannot Ping, or make an SSH connection, to the desired host.

(warning) Before you add the standard Ping, and SSH, service definitions for the host, verify that the host can make an SSH, and Ping, connection to the host by testing the check command on the command-line.   If the command returns a non-OK state, then there might be a legitimate problem.  If this is the case, check with the SLAC admins to see if the host intentionally blocks SSH, or Ping, connections.

  • If the host does not allow Ping, or SSH, connections, then do not create the corresponding service definition

Add the following host definition definition:

 define host{
          use                     generic-host            ; Name of host template to use
          host_name               <host address>
          alias                   <host name>
          address                 <host IP address>
          check_command           check-host-alive
          max_check_attempts      10
          check_period            24x7
          notification_interval   120
          notification_period     24x7
          notification_options    d,r
          contact_groups          core
  }

<host address> is the domain name of the host (i.e., foo.slac.stanford.edu).
<host name> is just the name of the host (i.e., foo).
<host IP address> is ... You guessed it! It's the IP address. If the host is Pingable, then create a service definition for Ping

  define service{
          use                             generic-service         ; Name of service template to use
          host_name                       <host address>
	  service_description             Ping
          is_volatile                     0
          check_period                    24x7
          max_check_attempts              4
          normal_check_interval           5
          retry_check_interval            1
          contact_groups                  core_no_nagios,critical_core
          notification_options            w,u,c,r
          notification_interval           960
          notification_period             24x7
          check_command                   check_ping!100.0,20%!500.0,60%


If the host allows an SSH connection, create a service definition for SSH

  define service{
          use                             generic-service         ; Name of service template to use
          host_name                       <host address>
          service_description             SSH
          is_volatile                     0
          check_period                    24x7
          max_check_attempts              4
          normal_check_interval           5
          retry_check_interval            1
          contact_groups                  core
          notification_options            w,u,c,r
          notification_interval           960
          notification_period             24x7
          check_command                   check_ssh
  }

Adding A New Service to the Configuration

Where to place the service definition?

Our convention is to place the service definition in a host configuration file named after the host for which service is associated with.  So for example, if you want to check the Ping service on the computer named, foo, then

  1. Create a host configuration file (if it does not already exist) named after the host (see the previous section, "Adding A New Host to the Configuration")
  2. Then append the service definition to that file

What additional service definition options need to be set?

Some of the service definition options (AKA directives) pretty much stay the same per service.  If no deviating requests are made, then all of the options, and associated values (denoted, "default") stay the same...

  define service{
	use                             generic-service    ;default.  Name of service template to use
	is_volatile                     0      	;default
	check_period                    24x7	;default
	max_check_attempts              4	;default
	normal_check_interval           5	;default
	retry_check_interval            1	;default
	notification_options            w,u,c,r	;default
	notification_interval           960 	;default
	notification_period             24x7	;default
	contact_groups                  <contact-group>
	host_name                       <host address>
	service_description             <service name>
	check_command                   <check-command>
}

More information about service definitions can be found in the Nagios 2.x documentation, section "Template based Object Definitions," here.   Time saving tricks can be found here (these are very handy).

How to backup the Configuration?

Use CVS to backup the configuration; just checkin the changes to the configuration that you have made. You may want to do a cvs diff before checkin.

Example: cvs diff sulky45.cfg

               cvs ci sulky45.cfg

If you are unfamiliar with CVS, there is plenty of documentation available online. 

(minus) No executables should be backed up.  Also, we do not back up the directory /etc/nagios/plugins (because we can download them again if we need them) or /etc/nagios/disabled-hosts.

  • No labels