I use Bacula for my backups. Most of my backups are done to disk, then copied to tape. I have long liked the idea of having multiple backups on different media. On a regular basis, I move the FULL backup tapes to an offsite location. This is standard practice and highly recommended if you can do it. I don’t use the latest and greatest tape drives. I’m using DLT drives. I like them because it’s pretty straight forward and cunningly clever. Such drives are readily available at decent prices. Data centers are upgrading to LTO technology and, thankfully, not all of them are tossing this equipment into the dustbin.
This morning, every one of my copy to tape jobs failed.
15-Dec 09:41 bacula-dir JobId 114139: Start Copying JobId 114139, Job=CopyToTape-Inc.2012-12-15_09.32.02_32 15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory 15-Dec 09:41 kraken-sd JobId 114139: Warning: Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist. 15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory 15-Dec 09:41 kraken-sd JobId 114139: Warning: Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist. 15-Dec 09:41 kraken-sd JobId 114139: Error: dev.c:120 Unable to stat device /dev/nsa0: ERR=No such file or directory 15-Dec 09:41 kraken-sd JobId 114139: Warning: Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist. 15-Dec 09:41 kraken-sd JobId 114139: Fatal error: Device reservation failed for JobId=114139: Jmsg Job=CopyToTape-Inc.2012-12-15_09.32.02_32 type=5 level=1355564516 kraken-sd JobId 114139: Warning: Device "DTL01" in changer "DTL01" requested by DIR could not be opened or does not exist. 15-Dec 09:41 bacula-dir JobId 114139: Fatal error: Storage daemon didn't accept Device "DTL01" because: 3924 Device "DTL01" not in SD Device resources.
I instantly knew what had happened. Last night I installed new batteries in all of my UPS units. This necessitated powering off all the systems. When I powered them up, I did so in the wrong order. My tape libraries are external units attached by SCSI cables. I should have powered up the tape libraries first, then the server to which they are attached. If I had done that, the system would have detected the tape libraries and the jobs would have run. Doing so in the other order (server first, tape libraries later) means that the server does not see the tape libraries until later. This is not the first time I have done this.
In order to save myself from myself, I will now attempt to create a Nagios check to verify that the proper devices are attached when the system starts up.
The Nagios check
This is the script I wrote. It’s pretty ugly, and not extremely flexible if you have many similar devices. However, it’s working for me. From /usr/local/libexec/nagios/check_camcontrol.sh:
#!/bin/sh # NRPE check for camcontrol # Written by: Dan Langille <dan@langille.org> # version 1.0 # # Covered by the two clause BSD license (which is not included here...) # PATH="/sbin:/bin:/usr/sbin:/usr/bin" # e.g. DEC TL800 DEVICETEXT=$1 # e.g. (pass0,ada0) DEVICENAMES=$2 if [ -x "/sbin/camcontrol" ] then # search the output of camcontrol for the device text DEVICE=`/sbin/camcontrol devlist | grep "${DEVICETEXT}"` else ERRORSTRING="camtrol binary does not exist on system" ERR=3 fi if [ "${DEVICE}" == "" ] then echo "${DEVICETEXT} not found in camcontrol output" exit 2 else # we found the device text, but is it at the right device? FOUNDDEVICENAMES=`echo $DEVICE | grep ${DEVICENAMES}` if [ "${FOUNDDEVICENAMES}" == "" ] then echo "${DEVICETEXT} was found, but not at ${DEVICENAMES}" exit 2 else echo "${DEVICETEXT} was found at ${DEVICENAMES}" exit 0 fi fi
To use this, I added the following to /usr/local/etc/nrpe.cfg on the system to which the tape libraries are attached:
# check that various devices are attached and running command[check_ch0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TL800 ch0 command[check_pass11]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TL800 pass11 command[check_sa0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TZ89 sa0 command[check_pass13]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh TZ89 pass13 command[check_ch1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh OVERLAND ch1 command[check_pass14]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh OVERLAND pass14 command[check_sa1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh DLT8000 sa1 command[check_pass12]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_camcontrol.sh DLT8000 pass12
Because camcontrol requires su privileges, I added the following entries via visudo:
nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TL800 ch0 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TL800 pass11 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TZ89 sa0 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh TZ89 pass13 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh OVERLAND ch1 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh OVERLAND pass14 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh DLT8000 sa1 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_camcontrol.sh DLT8000 pass12
The Nagios command
I use Fruity, a now-deprecated web-based Nagios configuration tool. So instead of demonstrating the Fruity commands, I’ll show you what I found in my Nagios configurations files for pass11. You can extend it from there.
From serviceextinfo.cfg:
define serviceextinfo{ host_name kraken service_description check_pass11 }
From services.cfg:
define service { use remote-service service_description check_pass11 check_command check_nrpe!check_pass11 host_name kraken contact_groups admins }
What happened?
After restarting nrpe on the client machine, Nagios started telling me about the missing devices. The following is a heavily butchered copy/paste from my Nagios webpage
Host | Service | Status | Last Check | Duration | Attempt | Status Information | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
CRITICAL | 12-15-2012 17:25:57 | 0d 0h 20m 13s | 4/4 | OVERLAND not found in camcontrol output | ||||||
|
CRITICAL | 12-15-2012 17:25:13 | 0d 0h 20m 57s | 4/4 | DLT8000 not found in camcontrol output | |||||||
|
CRITICAL | 12-15-2012 17:25:19 | 0d 0h 20m 51s | 4/4 | TZ89 not found in camcontrol output | |||||||
|
CRITICAL | 12-15-2012 17:25:24 | 0d 0h 20m 46s | 4/4 | OVERLAND not found in camcontrol output | |||||||
|
CRITICAL | 12-15-2012 17:25:30 | 0d 0h 20m 40s | 4/4 | TZ89 not found in camcontrol output | |||||||
|
CRITICAL | 12-15-2012 17:25:35 | 0d 0h 20m 35s | 4/4 | DLT8000 not found in camcontrol output |
Now, let’s rescan the devices and see what we have.
[dan@kraken:~] $ sudo camcontrol rescan 12 Password: Re-scan of bus 12 was successful [dan@kraken:~] $ sudo camcontrol amcontrol devlist <Hitachi HDS722020ALA330 JKAOA28A> at scbus0 target 0 lun 0 (pass0,ada0) <Hitachi HDS722020ALA330 JKAOA3MA> at scbus1 target 0 lun 0 (pass1,ada1) <Hitachi HDS722020ALA330 JKAOA28A> at scbus2 target 0 lun 0 (pass2,ada2) <Hitachi HDS722020ALA330 JKAOA28A> at scbus3 target 0 lun 0 (pass3,ada3) <Hitachi HDS722020ALA330 JKAOA28A> at scbus4 target 0 lun 0 (pass4,ada4) <Hitachi HDS722020ALA330 JKAOA28A> at scbus5 target 0 lun 0 (pass5,ada5) <Hitachi HDS722020ALA330 JKAOA28A> at scbus6 target 0 lun 0 (pass6,ada6) <Hitachi HDS722020ALA330 JKAOA28A> at scbus7 target 0 lun 0 (pass7,ada7) <ST380815AS 4.AAB> at scbus8 target 0 lun 0 (pass8,ada8) <TSSTcorp CDDVDW SH-S223C SB01> at scbus9 target 0 lun 0 (pass9,cd0) <WDC WD1600AAJS-75M0A0 02.03E02> at scbus10 target 0 lun 0 (pass10,ada9) <DEC TL800 (C) DEC 0326> at scbus12 target 0 lun 0 (pass11,ch0) <QUANTUM DLT8000 0250> at scbus12 target 4 lun 0 (pass12,sa1) <DEC TZ89 (C) DEC 2561> at scbus12 target 5 lun 0 (pass13,sa0) <OVERLAND LXB 0524> at scbus12 target 6 lun 0 (ch1,pass14) [dan@kraken:~] $
OK, there are the devices. Now let’s wait and check Nagios…
Well, that didn’t take long. I already have an email titled: ** RECOVERY alert – The ZFS file server/check_ch1 is OK *
I waited another few minutes and everything cleared up. Great. Next time I’m messing around with the tape libraries, this should not happen.
Time will tell
Time will tell if this helps me avoid future failed jobs. In this case, it is more the annoyance than something critical. However, keeping track of important devices and services is what Nagios is all about.