Adding Nagios checks for tape drives / libraries

I use Bacula for my backups. Most of my backups are done to disk, then copied to tape. I have long liked the idea of having multiple backups on different media. On a regular basis, I move the FULL backup tapes to an offsite location. This is standard practice and highly recommended if you can do it. I don’t use the latest and greatest tape drives. I’m using DLT drives. I like them because it’s pretty straight forward and cunningly clever. Such drives are readily available at decent prices. Data centers are upgrading to LTO technology and, thankfully, not all of them are tossing this equipment into the dustbin.

This morning, every one of my copy to tape jobs failed.

15-Dec 09:41 bacula-dir JobId 114139: Start Copying JobId 114139, Job=CopyToTape-Inc.2012-12-15_09.32.02_32
15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory
15-Dec 09:41 kraken-sd JobId 114139: Warning: 
     Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist.
15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory
15-Dec 09:41 kraken-sd JobId 114139: Warning: 
     Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist.
15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory
15-Dec 09:41 kraken-sd JobId 114139: Warning: 
     Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist.
15-Dec 09:41 kraken-sd JobId 114139: Fatal error: Device reservation failed for JobId=114139: Jmsg Job=CopyToTape-Inc.2012-12-15_09.32.02_32 type=5 level=1355564516 kraken-sd JobId 114139: Warning: 
     Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist.

15-Dec 09:41 bacula-dir JobId 114139: Fatal error: 
     Storage daemon didn't accept Device "DTL01" because:
     3924 Device "DTL01" not in SD Device resources.

I instantly knew what had happened. Last night I installed new batteries in all of my UPS units. This necessitated powering off all the systems. When I powered them up, I did so in the wrong order. My tape libraries are external units attached by SCSI cables. I should have powered up the tape libraries first, then the server to which they are attached. If I had done that, the system would have detected the tape libraries and the jobs would have run. Doing so in the other order (server first, tape libraries later) means that the server does not see the tape libraries until later. This is not the first time I have done this.

In order to save myself from myself, I will now attempt to create a Nagios check to verify that the proper devices are attached when the system starts up.

The Nagios check

This is the script I wrote. It’s pretty ugly, and not extremely flexible if you have many similar devices. However, it’s working for me. From /usr/local/libexec/nagios/check_camcontrol.sh:

#!/bin/sh
# NRPE check for camcontrol
# Written by: Dan Langille <dan@langille.org>
# version 1.0
#
# Covered by the two clause BSD license (which is not included here...)
#

PATH="/sbin:/bin:/usr/sbin:/usr/bin"

# e.g. DEC TL800
DEVICETEXT=$1

# e.g. (pass0,ada0)
DEVICENAMES=$2

if [ -x "/sbin/camcontrol" ]
then
 # search the output of camcontrol for the device text
 DEVICE=`/sbin/camcontrol devlist | grep "${DEVICETEXT}"`
else
 ERRORSTRING="camtrol binary does not exist on system"
 ERR=3
fi

if [ "${DEVICE}" == "" ]
then
  echo "${DEVICETEXT} not found in camcontrol output"
  exit 2
else
  # we found the device text, but is it at the right device?
  FOUNDDEVICENAMES=`echo $DEVICE | grep ${DEVICENAMES}`
  if [ "${FOUNDDEVICENAMES}" == "" ]
  then
    echo "${DEVICETEXT} was found, but not at ${DEVICENAMES}"
    exit 2
  else
    echo "${DEVICETEXT} was found at ${DEVICENAMES}"
    exit 0
  fi
fi

To use this, I added the following to /usr/local/etc/nrpe.cfg on the system to which the tape libraries are attached:

# check that various devices are attached and running
command[check_ch0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TL800 ch0
command[check_pass11]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TL800 pass11
command[check_sa0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TZ89 sa0
command[check_pass13]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TZ89 pass13

command[check_ch1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh OVERLAND ch1
command[check_pass14]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh OVERLAND pass14
command[check_sa1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh DLT8000 sa1
command[check_pass12]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh DLT8000 pass12

Because camcontrol requires su privileges, I added the following entries via visudo:

nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TL800 ch0
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TL800 pass11
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TZ89 sa0
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TZ89 pass13

nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh OVERLAND ch1
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh OVERLAND pass14
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh DLT8000 sa1
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh DLT8000 pass12

The Nagios command

I use Fruity, a now-deprecated web-based Nagios configuration tool. So instead of demonstrating the Fruity commands, I’ll show you what I found in my Nagios configurations files for pass11. You can extend it from there.

From serviceextinfo.cfg:

define serviceextinfo{  
host_name kraken
service_description check_pass11
}

From services.cfg:

define service {
    use remote-service
    service_description check_pass11
    check_command check_nrpe!check_pass11
    host_name kraken
    contact_groups admins
}

What happened?

After restarting nrpe on the client machine, Nagios started telling me about the missing devices. The following is a heavily butchered copy/paste from my Nagios webpage

Host

Service

Status

Last Check

Duration

Attempt

Status Information

kraken

check_ch1

CRITICAL

12-15-2012 17:25:57

0d 0h 20m 13s

4/4

OVERLAND not found in camcontrol output

check_pass12

CRITICAL

12-15-2012 17:25:13

0d 0h 20m 57s

4/4

DLT8000 not found in camcontrol output

check_pass13

CRITICAL

12-15-2012 17:25:19

0d 0h 20m 51s

4/4

TZ89 not found in camcontrol output

check_pass14

CRITICAL

12-15-2012 17:25:24

0d 0h 20m 46s

4/4

OVERLAND not found in camcontrol output

check_sa0

CRITICAL

12-15-2012 17:25:30

0d 0h 20m 40s

4/4

TZ89 not found in camcontrol output

check_sa1

CRITICAL

12-15-2012 17:25:35

0d 0h 20m 35s

4/4

DLT8000 not found in camcontrol output

Now, let’s rescan the devices and see what we have.

[dan@kraken:~] $ sudo camcontrol rescan 12
Password:
Re-scan of bus 12 was successful
[dan@kraken:~] $ sudo camcontrol amcontrol devlist 
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus0 target 0 lun 0 (pass0,ada0)
<Hitachi HDS722020ALA330 JKAOA3MA>  at scbus1 target 0 lun 0 (pass1,ada1)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus2 target 0 lun 0 (pass2,ada2)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus3 target 0 lun 0 (pass3,ada3)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus4 target 0 lun 0 (pass4,ada4)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus5 target 0 lun 0 (pass5,ada5)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus6 target 0 lun 0 (pass6,ada6)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus7 target 0 lun 0 (pass7,ada7)
<ST380815AS 4.AAB>                 at scbus8 target 0 lun 0 (pass8,ada8)
<TSSTcorp CDDVDW SH-S223C SB01>    at scbus9 target 0 lun 0 (pass9,cd0)
<WDC WD1600AAJS-75M0A0 02.03E02>   at scbus10 target 0 lun 0 (pass10,ada9)
<DEC TL800    (C) DEC 0326>        at scbus12 target 0 lun 0 (pass11,ch0)
<QUANTUM DLT8000 0250>             at scbus12 target 4 lun 0 (pass12,sa1)
<DEC TZ89     (C) DEC 2561>        at scbus12 target 5 lun 0 (pass13,sa0)
<OVERLAND LXB 0524>                at scbus12 target 6 lun 0 (ch1,pass14)
[dan@kraken:~] $

OK, there are the devices. Now let’s wait and check Nagios…

Well, that didn’t take long. I already have an email titled: ** RECOVERY alert – The ZFS file server/check_ch1 is OK *

I waited another few minutes and everything cleared up. Great. Next time I’m messing around with the tape libraries, this should not happen.

Time will tell

Time will tell if this helps me avoid future failed jobs. In this case, it is more the annoyance than something critical. However, keeping track of important devices and services is what Nagios is all about.