This documentation is intended to apprise someone, new to Nagios, about how to maintain Nagios. This documentation is intended to be used only as a starting guide; therefore, this documentation is not extensive and focuses on only the primary features of Nagios. This documentation will outline the general procedure for completing a Nagios related task, such as
Nagios has extensive documentation available online (we are using Nagios 2.x) http://support.nagios.com/knowledge-base/official-documentation.
Nagios is a network, and system, monitoring application. Essentially, Nagios watches specified hosts, and services, then alerts you when something goes wrong. We are currently using Nagios 2.x.
The Fermi Gamma-ray Telescope mission relies heavily on computers and network services for tasks, such as Science Analysis Software development, ground processing and data handling, data storage, etc.. The Fermi team heavily depends on machines, and services, at SLAC, Goddard Space Flight Center, and other participating institutions. Therefore, team-members need to immediately know whenever equipment, services, or hosts, have any problems. Nagios 2.x is the open source host, and service, monitoring software that we are using to monitor hosts and services, and alert team-members about specified host and service events.
* To request "sudo" access, first get permission/confirmation from Tony Johnson, then send an email request to the unix admin at SLAC.
In general, completing a Nagios related task can be summarized in the following three steps
In general, your requests will come through either of the following
In general, this requires modifying a Nagios configuration file, or adding a new configuration file for Nagios. Instructions on how to do this depend on the task. In general, most tasks result in just creating a new object definition, which tells nagios configuration information such as what to montior, how often to check the service, and who to notify.
After changes to the configuration files are made, first verify the configuration files, then restart Nagios via command-line.
Verify configuration
sudo /usr/bin/nagios -v /etc/nagios/nagios.cfg
Resolve errors, if any, then proceed to reload Nagios.
2. Reload Nagios
sudo /etc/init.d/nagios reload or sudo /sbin/service nagios restart
At the very basic level...
An example scenario...
Someone makes a request (maybe through JIRA, or ICQ, or even email)...
Dear Nagios-admin:
Please monitor the u05 nfs disk space.
Now the above request will likely seem vague to the beginner, but this is natural. With experience, you will learn to read minds, but please do seek clarification when needed. Now there are probably many questions that you have from the above request, but the key questions are
Since u05 is an nfs disk (as inferred from the request), let's find its host via the command line (note that the path to the nfs disks is /nfs/farm/g/glast/):
user@glastlnx02$ df /nfs/farm/g/glast/u05 Filesystem 1K-blocks Used Available Use% Mounted on sulky46:/g.glast.u05 196213184 182682240 11568800 95% /a/sulky46/g.glast.u05
Under "Filesystem," we find the host name, sulky46.
Q: So what do we do with the information about the host name?
A: Our convention is to include a service definition in a Nagios configuration file named after the host that the service belongs to. So for example, if the u05 disk resides on sulky46, then we create (if it does not already exist) a file named sulky46.cfg, and place the service definition, for the u05 disk space checks, in that file
What's the business about "service definitions?" I will explain the details later, but the objective here is to give you the big picture. Here are the steps to complete the task
Let's go through each step together...
1. Check to see if there exist a Nagios configuration file for the host that u05 resides on
Fortunately, we already have a Nagios configuration file for the host, sulky46: /etc/nagios/hosts/hosts/nfs_servers/sulky46.cfg.
2. Add a service definition to sulky46.cfg
Here is the service definition to check the u05 disk space and issue a WARNING alert if the disk space has less than 5% free space, and issue a CRITICAL alert if less than 2% disk space is available:
define service{ use generic-service host_name sulky46.slac.stanford.edu service_description u05 disk space is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_nfs_disk!5!2!/nfs/farm/g/glast/u05 notifications_enabled 1 }
The left column contains the directives, and the right column contains their values. The values for the directives above in red generally vary with the service definition, and depend on the host name, and service you are creating. The values for the directives above in green generally stay the same per service definition unless someone requests something different.
To get a detailed description of the directives, and what they mean, see the Nagios 2.x documentation, section, "Template-Based Object Configuration," here http://nagios.sourceforge.net/docs/2_0/xodtemplate.html. I will describe the main directives above, which are in red.
contact_groups
This directive is used to specify the contact groups to be notified about this service. The contact_groups value is set to "core." This means that members of the "core" contact group will receive service alerts for this particular service, "u05 disk space." All contacts and contact groups are defined in /etc/nagios/contacts.cfg. The format for the contact, and contact-group, definitions are generally the same for all object definitions, but the available directives depend on which object definition you are using (E.g., service group, host or hostgroup, commands, etc.).
In the file, /etc/nagios/contacts.cfg, we can see the definition for the "core" contact-group defintion:
define contactgroup{ contactgroup_name core alias Core Group members nagios, Navid, Richard, Heather, Karen, Tom, ecephas }
The above members will all receive notifications about this service, "u05 disk space."
check_command
This directive is the most important directive, and without it, a service definition is useless. This directive specifies the command to be used to perform the actual service check. The parameters for the command are separated by the "!" character.
The value of a check_command is an identifier associated with an actual program. In this case, the identifier is "check_nfs_disk." A check_command is created by making a "command_definition," in which the association between an identifier and an actual program is established. For example, the definition for the check_command, "check_nfs_disk," is
define command{ command_name check_nfs_disk command_line $USER5$/check_disk_nfs -w $ARG1$ -c $ARG2$ -p $ARG3$ }
In the above example command definition for check_nfs_disk, the value for $USER5$ is /etc/nagios/homemade_plugins.
Let's look back at the check_command value for the "u05 disk" service:
check_nfs_disk!5!2!/nfs/farm/g/glast/u05.
At run-time, the command-line translation would be:
/etc/nagios/homemade_plugins/check_disk_nfs -w 5 -c 2 -p /nfs/farm/g/glast/u05.
$USERn$ macros
$ARGn$ macros
host_name
This directive is used to specify the short name of the host that the service "runs" on or is associated with
service_description
This directive is used to define the description of the service, which may contain spaces, dashes, and colons
3. Add service description to the appropriate servicegroup in servicegroup.cfg
For the sake of organization we try to arrange the various services into servicegroups, which can be found in /etc/nagios/servicegroups.cfg. Since the "u05 disk space" service performs a disk space check on an NFS disk, the following member is added to the "Glast-NFS Space" service group:
members sulky46.slac.stanford.edu,u05 disk space
For the most part, the context of the service check will give you a sense of what service group it should associated with. The file /etc/nagios/servicegroups.cfg contains all the service groups.
There is another way to add a service to a service group. Instead of modifying the file, /etc/nagios/servicegroups.cfg, you can directly add the service to a service group from within the actual service definition by using the optional "servicegroups" directive. For example:
define { use generic-service host_name sulky46.slac.stanford.edu service_description u05 disk space servicegroups Glast-NFS Space is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_nfs_disk!5!2!/nfs/farm/g/glast/u05 notifications_enabled 1 }
4. Check that the configuration is syntactically correct
On the command-line, enter
sudo /usr/bin/nagios -v /etc/nagios/nagios.cfg
5. Test that the new service definition is correct
Nagios provides no explicit utility to test that a configuration is logically correct. To prevent false-alarms, there are two techniques that I use to test a new configuration (feel free to invent better techniques, but these work well):
Keep in mind that in most cases, you will not know whether the output is actually correct, but in some cases it can be clear that there is a problem. For example, let's say you wanted to test a service that checks for database connectivity. You probably assumed that you did not need a password for the check, but after running the test on the command-line, the program reports some error about an invalid password. As a result, you now know to contact someone to get the correct password for the database.
Testing from the command-line
Each check_command is associated with an actual program. You can find the path to the actual program by examining a check_command's actual command-definition in /etc/nagios/commands.cfg. Then just run it at the command-line to make sure there are no problems with the output.
Testing by disabling notifications
After the command-line test, a final check should be to set the service definition directive, "notifications_enabled," to 0, then reload Nagios. For example:
define { use generic-service host_name sulky46.slac.stanford.edu service_description u05 disk space is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_nfs_disk!5!2!/nfs/farm/g/glast/u05 notifications_enabled 0 }
This way, Nagios will report the check results on the Nagios Web Interface (http://glastlnx02.slac.stanford.edu/nagios), but notifications will not be sent out. After the service check results appear correct, then the next task is to enable notifications for that service by setting the "notificaitons_enabled" directive to 1, then reload Nagios.
6. Reload Nagios
On the command-line, enter
sudo /etc/init.d/nagios reload
Our Nagios configuration resides on the machine, glasltnx02.slac.stanford.edu. Object data is a generic term used to describe various data definitions you need in order to monitor anything. Object definitions are defined in a template format. For the complete definition, syntax, and available directives for each kind of Object Definition, see the Nagios 2.x documentation here.
The following table lists the files/directories in which object data is defined.
Type of Object Definition | Notes | Location (on glastlnx02) |
Service | These definitions are normally contained in the host configuration files, named after the host in which the service is associated with. There is an exception:
| /etc/nagios/hosts |
Service Groups | Optionally, a service definition can register itself as a member of a servicegroup via the <servicegroups> directive within a service definition. | /etc/nagios/servicegroups.cfg |
Hosts | A host definition is used to define a physical server, workstation, device, that resides on your network. The host configuration file is named after the host. For example, the host configuration file for glastlnx02.slac.stanford.edu is /etc/nagios/hosts/glastlnx02.cfg. | /etc/nagios/hosts/ |
Host Groups | Used to group one or more hosts together for display purposes in the CGIs. | /etc/nagios/hostgroup.cfg |
Contacts | Used to identify someone who should be contacted in the event of a problem on your network | /etc/nagios/contacts.cfg |
Contact Groups | Used to group one or more contacts together for the purpose of sending out alert/recovery notifications. | /etc/nagios/contacts.cfg |
Commands | These configuration files are well documented so you can learn a great deal from just opening them and viewing their contents | /etc/nagios/command-plugins.cfg |
Time periods | A list of times during various days that are considered to be "valid" times for notifications and service checks. This configuration file is well documented so you can learn a great deal from just viewing its contents | /etc/nagios/time.cfg |
Service Escalations | Completely optional and are used to escalate notifications for a particular service. Not currently being used, but feel free to try it if you find it useful. | N/A |
Service Dependencies | Advanced feature of Nagios that allow you to suppress notifications and active checks of services based on the status of one or more other services. Not currently being used, but feel free to try it if you find it useful. | N/A |
Host Escalations | Optional, and used to escalate notifications for a particular host. Not currently being used, but feel free to try it if you find it useful. | N/A |
Host Dependencies | An, optional, advanced feature of Nagios that allows you to suppress notifications for hosts based on the status of one or more other hosts. | N/A |
Extended Host Information | Used to make the output from the status, statusmap, statuswrl, and extinfo look neat. They have no effect on monitoring and are completely optional. Store Extended Host Information in the corresponding host's configuration file. | /etc/nagios/hosts/ |
Extended Service Information | Optional, and used to make the output from the status and extinfo CGIs look neat. This is rarely used and up to your discretion about when to use them. In any case, if you decide to use an Extended Service Information, then add it to the configuration file which contains the corresponding service definition. | /etc/nagios/hosts/ |
On the SLAC machine, glastlnx02, there is a main configuration file (/etc/nagios/nagios.cfg) where you specify the configuration files/directories that contain object definitions, by using the following two directives
So for example, in /etc/nagios/nagios.cfg, you will find
cfg_file=/etc/nagios/commands.cfg cfg_file=/etc/nagios/template.cfg cfg_dir=/etc/nagios/hosts/ cfg_file=/etc/nagios/hostgroups.cfg cfg_file=/etc/nagios/servicegroups.cfg cfg_file=/etc/nagios/contacts.cfg cfg_file=/etc/nagios/time.cfg
The main configuration file is /etc/nagios/nagios.cfg. Without it, Nagios is clueless. The main configuration file is responsible for setting global options, such as the Host Check Timeout, which is the maximum number of seconds that Nagios will allow host checks to run, and the Timing Interval Length, which is the maximum number of seconds that Nagios will allow host checks to run.
For the most part, the main configuration file is already setup. On rare occassions, you may want to make changes, or enhancements. The list of options avaialble for the main configuration file can be found in the Nagios 2.x documentation here.
Plugins are the programs/scripts (Perl, shell, etc.) that can be run from a command line to check the status, host, or service. Nagios is useless without plugins. Nagios uses the results from plugins to determine the current status or hosts and services on your network.
Where are our Nagios plugins?
We have two kinds of plugins, which are listed in the following table.
Type of Plugin | Description | Location (On glastlnnx02) |
---|---|---|
Standard | These plugins were already developed by external developers, and downloaded from http://sourceforge.net/projects/nagiosplug/. | /usr/lib64/nagios/plugins |
Custom | These are plugins that we/you develop. | /etc/nagios/homemade_plugins |
To determine what a plugin does, just run the plugin on the command line with the "-h" option. For example,
foo@glastlnx02 $ /usr/lib64/nagios/plugins/check_disk -h check_disk v1848 (nagios-plugins 1.4.11) Copyright (c) 1999 Ethan Galstad <nagios@nagios.org> Copyright (c) 1999-2006 Nagios Plugin Development Team <nagiosplug-devel@lists.sourceforge.net> This plugin checks the amount of used disk space on a mounted file system and generates an alert if free space is less than one of the threshold values Usage: check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device} [-C] [-E] [-e] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ] [-t timeout] [-u unit] [-v] [-X type] Options: -h, --help Print detailed help screen -V, --version Print version information -w, --warning=INTEGER Exit with WARNING status if less than INTEGER units of disk are free -w, --warning=PERCENT% Exit with WARNING status if less than PERCENT of disk space is free -c, --critical=INTEGER Exit with CRITICAL status if less than INTEGER units of disk are free -c, --critical=PERCENT% Exit with CRITCAL status if less than PERCENT of disk space is free -W, --iwarning=PERCENT% Exit with WARNING status if less than PERCENT of inode space is free -K, --icritical=PERCENT% Exit with CRITICAL status if less than PERCENT of inode space is free -p, --path=PATH, --partition=PARTITION Path or partition (may be repeated) -x, --exclude_device=PATH <STRING> Ignore device (only works if -p unspecified) -C, --clear Clear thresholds -E, --exact-match For paths or partitions specified with -p, only check for exact paths -e, --errors-only Display only devices/mountpoints with errors -g, --group=NAME Group pathes. Thresholds apply to (free-)space of all partitions together -k, --kilobytes Same as '--units kB' -l, --local Only check local filesystems -L, --stat-remote-fs Only check local filesystems against thresholds. Yet call stat on remote filesystems to test if they are accessible (e.g. to detect Stale NFS Handles) -M, --mountpoint Display the mountpoint instead of the partition -m, --megabytes Same as '--units MB' -A, --all Explicitly select all pathes. This is equivalent to -R '.*' -R, --eregi-path=PATH, --eregi-partition=PARTITION Case insensitive regular expression for path/partition (may be repeated) -r, --ereg-path=PATH, --ereg-partition=PARTITION Regular expression for path or partition (may be repeated) -I, --ignore-eregi-path=PATH, --ignore-eregi-partition=PARTITION Regular expression to ignore selected path/partition (case insensitive) (may be repeated) -i, --ignore-ereg-path=PATH, --ignore-ereg-partition=PARTITION Regular expression to ignore selected path or partition (may be repeated) -t, --timeout=INTEGER Seconds before connection times out (default: 10) -u, --units=STRING Choose bytes, kB, MB, GB, TB (default: MB) -v, --verbose Show details for command-line debugging (Nagios may truncate output) -X, --exclude-type=TYPE Ignore all filesystems of indicated type (may be repeated) Examples: check_disk -w 10% -c 5% -p /tmp -p /var -C -w 100000 -c 50000 -p / Checks /tmp and /var at 10% and 5%, and / at 100MB and 50MB check_disk -w 100M -c 50M -C -w 1000M -c 500M -g sidDATA -r '^/oracle/SID/data.*$' Checks all filesystems not matching -r at 100M and 50M. The fs matching the -r regex are grouped which means the freespace thresholds are applied to all disks together check_disk -w 100M -c 50M -C -w 1000M -c 500M -p /foo -C -w 5% -c 3% -p /bar Checks /foo for 1000M/500M and /bar for 5/3%. All remaining volumes use 100M/50M Send email to nagios-users@lists.sourceforge.net if you have questions regarding use of this software. To submit patches or suggest improvements, send email to nagiosplug-devel@lists.sourceforge.net
In general, you develop a custom plugin if you cannot find a pre-existing plugin that accomplishes the kind of check that you desire.
Creating a Custom Plugin
Creating a Nagios plugin is fairly simple. There are three basic steps
Store the path to the program in a $USERx$ (see /etc/nagios/resource.cfg) variable. I have arbitrarily chosen the $USER5$ variable to point to /etc/nagios/homemade_plugins.
How do I obtain plugins?
You can find additional plugins at http://sourceforge.net/projects/nagiosplug/. I recommend that you first check to see if they are availabe in /usr/lib64/nagios/plugins, or /etc/nagios/homemade_plugins/.
How to make Nagios Use the Plugin?
All you need to do is create a command definition in /etc/nagios/commands.cfg. You can find documentation for how to do so in the file /etc/nagios/commands.cfg. The command definition associates a name with an actual exectuable. Here is an example command definition
define command{ command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 }
The macro, $HOSTADDRESS$, refers to the host address for the host, or service, for which the check-command is used. The value of $USER1$ is /usr/lib64/nagios/plugins (see /etc/nagios/resource.cfg). Here is an example host definition that uses the command "check-host-alive" command
define host{ use generic-host ; Name of host template to use host_name glastlnx02.slac.stanford.edu alias glastlnx02 address 134.79.200.89 check_command check-host-alive max_check_attempts 10 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups core }
Some Nagios plugins are only intended to check local resources (disk space, CPU load, etc.). For example, let's say you want to check the disk space usage for the /tmp directory on glastlnx05. Well there is a Nagios plugin, check_disk, on glastlnx02, but here are the problems
The NRPE addon is the facility that will allow you to execute Nagios plugins on remote Linux/Unix machines, and communicate the result to Nagios as if the plugin is being run locally.
How to use NRPE on Machines at SLAC?
You do not need to worry about configuring the NRPE addon. All you need to do is the following
Notice that there are no thresholds supplied to the "check_disk" plugin when it is passed to the "check_nrpe" command. When you email the Unix-admins, they will normally discuss with you the thresholds that will be used for the particular command, and the Unix-admins will set the thresholds on their end.
Tip: Some Nagios plugins are used to check local resources, but it is difficult to tell by their names. Command definitions (see /etc/nagios/commands.cfg) have been created to provide more meaningful command-names for those Nagios plugins that are for local resources (local to the machine that Nagios is running on, i.e., glastlnx02):
If the machines to be monitored are behind a firewall (e.g. on Stanford campus) and the host running Nagios is outside of that firewall it will be a challenge. One solution is to poke a tiny hole in the firewall filter that allows Nagios to make calls to the nrpe daemon running on the host machine behind the firewall. We don't have any actual experience for this case yet so the information is limited. For more details refer to indirect host checking.
The service check, and host check, logic makes the decision to send notifications out for a host, or service, check. Nagios allows you to control when, and how, notifications get sent.
One directive in the main configuration file, /etc/nagios/nagios.cfg, overrides the decision to send notifications for ALL hosts/services: enable_notifications. This option determines whether or not Nagios will send out notifications when Nagios (re)starts. If enable_notifications=0, then Nagios will NOT send out any notifications, despite the results from the host/service checks.
Several filters must be passed before notifications are sent: Program wide -> Service/Host -> Contact. The details can be found in the Nagios 2.x documentation, section "Notifications," here
Program wide filter
The <enable_notifications> option can be set in the main configuration file (/etc/nagios/nagios.cfg) to enable/disable notifications on the program-wide basis. If notifications are disabled here, then no notifications are sent at ALL. If notifications are not enabled, then they must pasts the Service/Host filter...
Service and Host Filters
Several filters must be passed within the scope of Service and Host notifications
Contact Filter
After the first set of filters (program wide, service & host) are passed, each contact has its own filter that a notification must pass through. Within a contact definition, you can separately specify the kinds of notifications that you can receive from hosts, or services, via the following directives
Notifications for a host, or service, are sent if the notification type is enabled by the corresponding directives listed above.
What kind of Notifications are available?
In general, anything you can do from a command-line can be tailored to send notifications. Here is a list of possible notification methods
Email is our primary notification method. A notification method is implemented by a command definition. Inside a contact definition, you specify the kind of host/service notification method via the following two directives
For example, the contact definition for Mr. Foo would be ...
define contact{ contact_name ecephas alias Emmanuel Cephas service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,r host_notification_options n service_notification_commands notify-by-email host_notification_commands host-notify-by-email email ecephas@slac.stanford.edu }
I will explain the other directives in the next section (Contact Definitions). You can find the command definitions for various notifications (epager, qpage, an SMS client, etc.) in /etc/nagios/command-plugins.cfg.
When do Notifications get sent?
Host and service notifications get sent out in the following situations
Host remains in Hard Non-OK State and the <notification_interval>(a service/host definition directive) time has elapsed since the last check
Tip: If you want to prevent notifications from getting sent out more than once for any given problem, then set the <notification_interval> value to 0.
The host, and service, definition directive, notification_interval, is used to define the number of "time units" to wait before re-notifying a contact that this service is still in a non-OK state. The default value of interval_length is 60, therefore the integer value of notification_interval is interpreted as a multiple of 60 seconds. So in the following example, the "core" contact group will be notified every 16 hours (960 * 60 seconds) in the event that the "u05 disk space" is still in a HARD Non-OK state:
define { use generic-service host_name sulky46.slac.stanford.edu service_description u05 disk space is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_nfs_disk!5!2!/nfs/farm/g/glast/u05 notifications_enabled 0 }
It is important to understand the state types (hard and soft) because Nagios uses the state type to determine if, and how, to respond to a check result. I am going to provide a brief overview. For more details, see section "State Types," in the Nagios 2.x documentation http://nagios.sourceforge.net/docs/2_0/statetypes.html. The current state of a host or service is determined by two components
The <max_check_attempts> directive in a service, or host, is the crucial factor that determines a state's type (hard or soft). Once the service, or host, remains in the same non-OK state (i.e., CRITICAL, UNKNOWN, DOWN, WARNING) after <max_check_attempts> has been exceeded, then the state becomes Hard.
See the Nagios 2.x documentation, section "Template Based Object Configuration" http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#service for a description of <max_check_attempts>.
For example, whether notifications are sent, or resent, depend on the state type. Occasionally you will be reading through the documentation and it may tell you that Nagios only does some particular operation if the service, or host, enters a specific state type (i.e., HARD or SOFT). The different state types are ...
Soft States
Soft states occur when a service or host check
Recovers from a soft state (called a soft recovery)
Soft state events (What you need to know about Soft states)
Hard states
Hard states occur when in the following situations
Hard state change
Hard state changes occur when a service or host changes from a
hard non-OK state of some kind to a hard non-OK state of another kind (i.e. from a hard WARNING state to a hard UNKNOWN state)
Hard state events (What you need to know about Hard states)
In order for a person to receive any kind of notification, you need to create a contact definition for that person. Nagios 2.x allows you to specify a contact group to receive notifications for a particular host, or service. Each contact group is made up of members of contacts.
Documentation about how to create a contact definition can be found here.
What you need to know specific about our setup
The Nagios mailing list (nagios@glast.stanford.edu) is a mailing list that a person can sign up for to receive specific notifications (down, critical, recovery) about
A contact group is used to group together multiple contacts into a single identifier for the purpose of sending notifications. For more details, see http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#contactgroup. To create a contact group, you need to create a contact group definition.
Where are our contact group definitions located?
Both contact definitions and contact group definitions are located on the computer, glastlnx02, in the file /etc/nagios/contacts.cfg.
How to view our contact groups?
You can view our contact groups in two ways
The Web Interface is more ideal for viewing the details about contact groups, because the additional, associated, configuration information (i.e., Service Notification Period, Service Notification Command, etc.) is centralized in one place.
Our Current Setup (as of July 7, 2009)
The following table shows which contact groups get notified about which hosts and services. You can find out the members of a contact group by following the directions from the previous section ("How to view contact groups").
contact group(s) | Host(s) / Service(s) Notifications | Notes |
---|---|---|
core | All non ISOC hosts and services (see Web Interface, "Hostgroup Overview," or "Service Overview") and the "Ping" service for the "Goddard" host. |
|
flightops | ISOC hosts and services (see Web Interface, "Hostgroup Overview," or "Service Overview" for list of ISOC hosts and services) |
|
systests | u17 disk space on sulky27 | The core group also watches u17 disk space. |
*core_no_nagios,critical_core | ALL Ping services EXCEPT the Ping service for the host, "Goddard." | *This combination of contact groups satisfies the logic required for the core group (with the exception of the Nagios mailing list) to receive ALL notifications types (WARNING, UNKNOWN, CRITICAL, etc.) |
warren_focke | hosts: sulky27 and sulky39 | notifications for warning/unknown states for u15, u52. |
How to create a contact group definition?
A contact group definition is just another kind of Nagios object definition. Documentation for creating contact group definitions can be found here http://nagios.sourceforge.net/docs/2_0/xodtemplate.html.
Our convention is to keep the contact, and contact group, definitions in the file /etc/nagios/contacts.cfg on the computer, glastlnx02.
For example, the following contact group definition specifies the "core" contact group
define contactgroup{ contactgroup_name core alias Core Group members nagios, Navid, Richard, Heather, Karen, Tom }
How to use the contact group definition?
Now if you want to send out notifications about a host, or service, to members of a contact group, then set the contact_groups directive to "core," in a host or service definition. For example...
define service{ use generic-service ; Name of service template to use host_name foo.slac.stanford.edu service_description Ping is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60% }
Documentation for creating any kind of Nagios object definition can be found in the Nagios 2.x documentation, section "Template-Based Object Configuration" (see http://nagios.sourceforge.net/docs/2_0/xodtemplate.html).
Special contact groups for our Nagios setup
You can only tell Nagios to notify a contact "group" about a particular service, or host, event. In this version of Nagios (Nagios 2.x), you cannot directly specify a single contact to be notified about a service, or host, event. For example, in the "Ping" service definition above, instead of notifying the entire contact group about the Ping service, let's say we only want to notify a single member (i.e., Tom) of the "core" contact group. Well there is no directive available, in Nagios 2.x, to specify a single contact (though this is possible in Nagios 3.x). If you want a host or service notification to go to a single person, this is what you do..
If you want a host or service notification to go to a single contact, just create a contact group whose member is that single contact. Then just specify that contact group.
So, in this example, since we want to send notifications to only Tom, then here is what we do... Create a contact group whose only member is Tom...
define contactgroup{ contactgroup_name Tom_group alias Group for Tom members Tom }
Now we change the value for the <contact_groups> directive in the "Ping" service...
define service{ use generic-service ; Name of service template to use host_name foo.slac.stanford.edu service_description Ping is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups Tom_group notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60% }
Now the above service definition will send notifications about the "Ping" service to Tom.
What to name the file?
Our convention is to name the file after the name of the host (E.g., For the host, glastlnx02.slac.stanford.edu, name the file /etc/nagios/hosts/glastlnx02.cfg).
What standard host, and service, definitions should be added to the configuration file
If you are adding a new host to the Nagios configuration, our convention is to add two standard service (Ping & SSH) definitions, unless glastlnx02 legitimately cannot Ping, or make an SSH connection, to the desired host.
Before you add the standard Ping, and SSH, service definitions for the host, verify that the host can make an SSH, and Ping, connection to the host by testing the check command on the command-line. If the command returns a non-OK state, then there might be a legitimate problem. If this is the case, check with the SLAC admins to see if the host intentionally blocks SSH, or Ping, connections.
Add the following host definition definition:
define host{ use generic-host ; Name of host template to use host_name <host address> alias <host name> address <host IP address> check_command check-host-alive max_check_attempts 10 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups core }
<host address> is the domain name of the host (i.e., foo.slac.stanford.edu).
<host name> is just the name of the host (i.e., foo).
<host IP address> is ... You guessed it! It's the IP address. If the host is Pingable, then create a service definition for Ping
define service{ use generic-service ; Name of service template to use host_name <host address> service_description Ping is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core_no_nagios,critical_core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60%
If the host allows an SSH connection, create a service definition for SSH
define service{ use generic-service ; Name of service template to use host_name <host address> service_description SSH is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups core notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ssh }
Where to place the service definition?
Our convention is to place the service definition in a host configuration file named after the host for which service is associated with. So for example, if you want to check the Ping service on the computer named, foo, then
What additional service definition options need to be set?
Some of the service definition options (AKA directives) pretty much stay the same per service. If no deviating requests are made, then all of the options, and associated values (denoted, "default") stay the same...
define service{ use generic-service ;default. Name of service template to use is_volatile 0 ;default check_period 24x7 ;default max_check_attempts 4 ;default normal_check_interval 5 ;default retry_check_interval 1 ;default notification_options w,u,c,r ;default notification_interval 960 ;default notification_period 24x7 ;default contact_groups <contact-group> host_name <host address> service_description <service name> check_command <check-command> }
More information about service definitions can be found in the Nagios 2.x documentation, section "Template based Object Definitions," here. Time saving tricks can be found here (these are very handy).
Use CVS to backup the configuration; just checkin the changes to the configuration that you have made. You may want to do a cvs diff before checkin.
Example: cvs diff sulky45.cfg
cvs ci sulky45.cfg
If you are unfamiliar with CVS, there is plenty of documentation available online.
No executables should be backed up. Also, we do not back up the directory /etc/nagios/plugins (because we can download them again if we need them) or /etc/nagios/disabled-hosts.