Monitoring: Nagios

Monitoring Basics

What types of resources should we monitor?

What is Nagios?

History of Nagios

Components of Nagios

Core server process
  • Core logic for monitoring
  • Keeps track of service states
  • Starts service checks
CGI Web interface
  • Simple web interface which connects to the core server process via sockets

Components of Nagios

  • Scripts written to gather monitoring information
  • Typically written in Perl, but can be written in about anything
  • Has an API that you follow to create your own plugin
  • Daemons that handle remote checks
  • NRPE: Active checking daemon
  • NSCA: Passive checking daemon (just listens for data) The client must:
    • Run the check (and schedule it)
    • Send the data to NSCA using send_nsca

Passive vs. Active


Images from documentation site

Active: NRPE


Problems with Active checks

What kind of problems would we have?

Passive: NSCA


When are Passive checks useful?



Check_MK is an extension to Nagios that allows more flexibility checking servers.

CheckMK Architecture


# Install EPEL repo first!
$ yum install nrpe nagios-plugins*
$ cd /usr/lib64/nagios/plugins
$ ./check_ssh localhost
SSH OK - OpenSSH_6.6.1 (protocol 2.0) | time=0.188930s;;;0.000000;10.000000

$ ./check_disk -w 15% -c 10%
DISK OK - free space: / 8223 MB (85% inode=92%); /dev 235 MB (100% inode=99%);
/dev/shm 244 MB (100% inode=99%); /run 240 MB (98% inode=99%); /sys/fs/cgroup
244 MB (100% inode=99%); /run/user/1000 48 MB (100% inode=99%);|
/=1376MB;8539;9041;0;10046 /dev=0MB;199;211;0;235 /dev/shm=0MB;207;219;0;244
/run=4MB;207;219;0;244 /sys/fs/cgroup=0MB;207;219;0;244

$ ./check_http -H
HTTP OK: HTTP/1.1 200 OK - 40668 bytes in 0.013 second response time
| time=0.013421s;;;0.000000 size=40668B;;;0

NRPE Configuration

# /etc/nagios/nrpe.conf on the remote host
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_hda1]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% \
  -p /dev/hda1

# Command ran on the nagios server
check_nrpe -H -c check_load

# Testing it on a local machine
$ systemctl start nrpe
$ /usr/lib64/nagios/plugins/check_nrpe -H -c check_load
OK - load average: 0.04, 0.13, 0.07|load1=0.040;15.000;30.000;0;
load5=0.130;10.000;25.000;0; load15=0.070;5.000;20.000;0;

Nagios Configuration Overview


Nagios configuration visualized

Nagios Config components

Main configuration file
  • Configures how the daemon operates
Resource file(s)
  • User defined macros (i.e. notification commands)
Object definition files
  • Define hosts, services, hostgroups, contacts, contactgroups, commands
CGI configuration file
  • How the web interface is setup

Main configuration file


Main configuration file options


Resource configuration file(s)


# Sets $USER1$ to be the path to the plugins
# Sets $USER2$ to be the path to event handlers
# Store some usernames and passwords (hidden from the CGIs)

Object Configuration Overview


Objects Defined

Central object in monitoring logic and are usually physical devices, have an IP address, have one more services assigned to it and can have a parent/child relationship with other hosts configured.
Host Groups
Groups of one or more hosts. Groups can make it easier to view the status of related hosts from the web interface and simplify the configuration.
Another central object in monitoring logic and are associated with hosts. They can be attributes of a host (CPU load, disk usage, etc.), services provided by the host (HTTP, SSH, etc.), and other things associated with the host (DNS records, etc.)

Objects Defined

Service Groups
Groups of one ore more services. Service groups can make it easier to view status of related services in the web interface and simplify the configuration.
People involved in the notification process. Contacts have one or more notification methods (cell phone, pager, email, etc.), and receive notifications for hosts and services they are responsible for.
Contact Groups
Groups of one or more contacts. Contact groups can make it easier to define all the people who get notified when certain host or service problems occur.

Objects Defined

Time Periods
Used to control when hosts and services can be monitored and when contacts can receive notifications.
Used to tell Nagios what programs, scripts, etc. it should execute to perform its tasks. Tasks may include host and service checks, notifications, event handlers and much more.

Object Definition Examples

# Host definition
define host {
  host_name      foo
  use            generic-host
  hostgroups     nrpe-hosts,ping-hosts
  contact_groups admins

# Host Group definition
define hostgroup {
  hostgroup_name  example-servers
  alias           Example Servers
  members         foo,bar

# Service definition
define service {
  use                 generic-service
  hostgroup_name      nrpe-hosts
  service_description SSH
  check_command       check_ssh

# Contact definition
define contact {
  contact_name  nagiosadmin
  use           generic-contact
  alias         Nagios Admin

# Contact group definition
define contactgroup {
  contactgroup_name admins
  alias             Nagios Admins
  members           nagiosadmin

# 'workhours' Time Period definition
define timeperiod {
  timeperiod_name workhours
  alias           Normal Work Hours
  monday          09:00-17:00
  tuesday         09:00-17:00
  wednesday       09:00-17:00
  thursday        09:00-17:00
  friday          09:00-17:00

# Command definition
define command {
  command_name    check_ping
  command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w \
    $ARG1$ -c $ARG2$ -p 5

CGI Configuration File



Installing Nagios

$ yum install epel-release
$ yum install nagios
$ systemctl start nagios httpd