I’ve been using Nagios for a while. I use it to monitor many things, varying from disk space to disk temperature. One of the packages I use for this is net-mgmt/nagios-check_smartmon. This code is getting out of date it seems. According to the timestamp at the top of the file, the last time it was updated was 2006-03-24 10:30:20.
So it’s not surprising that it’s failing to work properly on a few cases. I have encountered one such case.
I have a system with several hard drives, all of which happen to be TOSHIBA (the brand is not important). What is relevant is how those drives are connected. Several are attached to a SATA card
mps0: <LSI SAS2008> port 0xc000-0xc0ff mem 0xfe83c000-0xfe83ffff,0xfe840000-0xfe87ffff irq 44 at device 0.0 on pci1)
While others are attached to the motherboard. The difference can be seen in /var/run/dmesg.boot:
# grep TOSH /var/run/dmesg.boot ada0:ATA-8 SATA 3.x device ada1: ATA-8 SATA 3.x device ada2: ATA-8 SATA 3.x device ada3: ATA-8 SATA 3.x device ada4: ATA-8 SATA 3.x device da1: Fixed Direct Access SCSI-6 device da4: Fixed Direct Access SCSI-6 device da0: Fixed Direct Access SCSI-6 device da3: Fixed Direct Access SCSI-6 device da2: Fixed Direct Access SCSI-6 device
Some HDD are presented to the system as ATA devices (see ada(4)), while others are represented as SCSI devices (see da(4)).
It is the devices attached to the SATA card which are presented as SCSI devices:
da0 at mps0 bus 0 scbus0 target 2 lun 0 da0: <ATA TOSHIBA DT01ACA3 ABB0< Fixed Direct Access SCSI-6 device da0: 600.000MB/s transfers da0: Command Queueing enabled da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
The problem arises with the da devices, not the ada devices. Here is a working example:
# /usr/local/libexec/nagios/check_smartmon -d /dev/ada0 OK: device is functional and stable (temperature: 33)|TEMP=33;55;60;
And a failure:
# /usr/local/libexec/nagios/check_smartmon -d /dev/da0 Traceback (most recent call last): File "/usr/local/libexec/nagios/check_smartmon", line 307, in(healthStatus, temperature) = parseOutput(healthStatusOutput, temperatureOutput, devtype) File "/usr/local/libexec/nagios/check_smartmon", line 216, in parseOutput vprint(3, "Health status: %s" % healthStatus) UnboundLocalError: local variable 'healthStatus' referenced before assignment
The problem seems to be that the system is unable to correctly determine the device type (i.e. ATA versus SCSI). It does contain a special case for FreeBSD SCSI devices, and it attempts to use that. This is where it fails. These are ATA devices, not SCSI. Thus, the extraction of the correct information fails as it is looking for a SCSI format output within ATA output.
Fortunately, the code allows you to specify the device type on the command line:
# /usr/local/libexec/nagios/check_smartmon -h Usage: check_smartmon [options] device Options: --version show program's version number and exit -h, --help show this help message and exit -d DEVICE, --device=DEVICE device to check -v LEVEL, --verbosity=LEVEL set verbosity level to LEVEL; defaults to 0 (quiet), possible values go up to 3 -t DEVTYPE, --type=DEVTYPE type of device (ATA|SCSI) -w TEMP, --warning-threshold=TEMP set temperature warning threshold to given temperature (defaults to 55) -c TEMP, --critical-threshold=TEMP set temperature critical threshold to given temperature (defaults to 60)
Unfortunately, it appears to fail to use this value appropriately. Look at this code:
# check device type, ATA is default vprint(2, "Get device type") devtype = options.devtype if not devtype: devtype = "ATA" if device_re.search( device ): devtype = "scsi"
options.devtype is optionally assigned from the command line argument -t (or –type). But lines 296 and 297 will overwrite any value set by the command line for a FreeBSD SCSI device. This effectively ignores the -t argument.
My solution is to not do this assignment if devtype is already specified. Here is my code:
# check device type, ATA is default vprint(2, "Get device type") devtype = options.devtype if not devtype: if device_re.search( device ): devtype = "scsi" else: devtype = "ATA"
Now, when we specify the device type (using my code in root’s home directory), it works:
# /root/check_smartmon -d /dev/da0 -t ata OK: device is functional and stable (temperature: 33)|TEMP=33;55;60;
Note that I have specified ata, not ATA. This differs from what the help says: type of device (ATA|SCSI)
My patch, which fixes both the argument overwite issue and the help documentation is:
# diff -u /usr/local/libexec/nagios/check_smartmon /root/check_smartmon --- /usr/local/libexec/nagios/check_smartmon 2013-07-25 11:40:50.491011205 +0000 +++ /root/check_smartmon 2013-07-25 18:21:39.149894864 +0000 @@ -59,7 +59,7 @@ metavar="LEVEL", help="set verbosity level to LEVEL; defaults to 0 (quiet), \ possible values go up to 3") parser.add_option("-t", "--type", action="store", dest="devtype", default="ata", metavar="DEVTYPE", - help="type of device (ATA|SCSI)") + help="type of device (ata|scsi)") parser.add_option("-w", "--warning-threshold", metavar="TEMP", action="store", type="int", dest="warningThreshold", default=55, help="set temperature warning threshold to given temperature (defaults to 55)") @@ -290,11 +290,12 @@ # check device type, ATA is default vprint(2, "Get device type") devtype = options.devtype + vprint(2, "command line supplied device type is: %s" % devtype) if not devtype: - devtype = "ATA" - - if device_re.search( device ): - devtype = "scsi" + if device_re.search( device ): + devtype = "scsi" + else: + devtype = "ata" vprint(1, "Device type: %s" % devtype)
I have added a new debug statement, shown on line 17 of the patch.
Feel free to use this patch anyway you want. No restrictions.
See also http://dan.langille.org/2014/03/02/nagios-smartmon-stops-working-after-python-upgrade/
It seems this problem has arisen again for me
After using this in-house for two years, it’s time to submit a PR. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201767
New patch raised after output changes.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235475