Nov 042018
 

smartd and smartctl are two utility programs included with the smartmontools package. On FreeBSD, smartmontools is installed via the sysutils/smartmontools port.

Those programs can “control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks. In many cases, these utilities will provide advanced warning of disk degradation and failure.” See the smartmontools website for more information.

NOTE: “Due to OS-specific issues and also depending on the different state of smartmontools development on the platforms, device support is not the same for all OS platforms.” – use the documentation for your OS.

I first started using smartd in March 2010 (according to that blog post, that’s when I still writing on both The FreeBSD Diary and this blog). Back then, and until recently, all I did was start smartd. As far as I can tell, all it did was send daily status messages via the FreeBSD periodic tools. I would set my drive devices via daily_status_smart_devices in /etc/periodic.conf and the daily status reports would include drive health information. That output looks like this and arrives in the ‘daily run output’ email, or wherever the <dir>_output specifies, if set via periodic.conf:

SMART status:
Checking health of /dev/ada0: OK
Checking health of /dev/ada1: OK
Checking health of /dev/da0: OK
Checking health of /dev/da1: OK
Checking health of /dev/da2: OK
Checking health of /dev/da3: OK
Checking health of /dev/da4: OK
Checking health of /dev/da5: OK
Checking health of /dev/da6: OK
Checking health of /dev/da7: OK

Looking at /usr/local/etc/periodic/daily/smart, I see that it invokes smartctl using:

${smartctl} ${devflags} ${daily_status_smartctl_flags} ${device} > "${tmpfile}"

daily_status_smartctl_flags defaults to -H which is a basic health check. Perhaps this could be used to invoke a long/short test, but I don’t think that’s the best way to achieve the goal.

Two types of tests

There are two types of tests I have run in the past: short, and long.

The short test might take two minutes. The long test could take 8 hours. You can see some of that output in this post from 2016. When I get a new drive, I run both the short and the long test on them. While far from an exhaustive test, it is basic.

Now I see how I can easily configure those tests to run on a regular basis and get emailed about any issues.

My original abandoned attempt

I was nearly a week ago that I tweeted:

I have a strong urge to write a script to run monthly smartctl long tests on my drives. It would stagger them out through the month, one per day.

Or does such a script already exist?

I wrote a small script to enumerate the drives and test one per day and was getting ready to do more work.

Bryce Chidester saved me from all that. Bryce pointed out that smartd already has scheduling built in.

This has saved me so much time and energy.

How do you prove it works?

When I start using a new feature, I like to know that it works. Given this is going to be running tests on a regular basis, I want to see that it does what it says.

Let’s look at this Dell R710 (it is my database server and also run daily poudriere builds) which has this in /usr/local/etc/smartd.conf:

# DEVICESCAN
/dev/da0 -a -d scsi -m dan@langille.org -s S/../.././15
/dev/da1 -a -d scsi -m dan@langille.org -s S/../.././15
/dev/da2 -a -d scsi -m dan@langille.org -s S/../.././15
/dev/da3 -a -d scsi -m dan@langille.org -s S/../.././15

/dev/da0 -a -d scsi -m dan@langille.org -s L/../28/./23
/dev/da1 -a -d scsi -m dan@langille.org -s L/../01/./01
/dev/da2 -a -d scsi -m dan@langille.org -s L/../02/./01
/dev/da3 -a -d scsi -m dan@langille.org -s L/../03/./01

DEVICESCAN was my old configuration. I now think it’s not very useful.

In my more recent examples, I’ve been combining the long and short tests into a single row. Details appear latest in this post.

How did I come up with this configuration? I asked smartctl for information.

[dan@r710-01:~] $ sudo smartctl --scan
/dev/da0 -d scsi # /dev/da0, SCSI device
/dev/da1 -d scsi # /dev/da1, SCSI device
/dev/da2 -d scsi # /dev/da2, SCSI device
/dev/da3 -d scsi # /dev/da3, SCSI device
/dev/da4 -d scsi # /dev/da4, SCSI device
/dev/da5 -d scsi # /dev/da5, SCSI device
/dev/cd0 -d atacam # /dev/cd0, ATA device

I ignored /dev/cd0 because I don’t want to test it. Despite what it says, these drives are not SCSI devices, they are SATA drives. I seem to recall that is one of the features of the Common Access Method.

We have a list of six drives. Let’s look at what this line does:

/dev/da0 -a -d scsi -m dan@langille.org -s S/../.././15

Reading the FreeBSD smartd.conf man page (remember: the documentation is OS-specific):

  1. -a : the same as turning on a bunch of directives, and is the default setting. Read the docs for more information.
  2. -d : the type of device.
  3. -m : the address to send warning emails. I’ll show you the email I got later in this post.
  4. -s : the selftest details follow in the form of a regex. In my case, I run a Short test (S) every day sometime during the 15 hour.

The most intriguing and powerful feature is that bit after the -s: S/../.././15 which says: run a Short test every day during the 15th hour. smartd does not schedule via cron. Instead, it polls every 30 minutes (by default) and runs looks for any REGEXP which matches the current local date, and time / day of week. For more information, search for REGEXP on the man page. My examples later may help.

See below for the format I am now using. The above is a simple example.

Looking at the test results

Here are the test results as of 4 November 2018:

[dan@r710-01:~] $ sudo smartctl -l selftest /dev/da0 
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      8216         -
# 2  Short offline       Completed without error       00%      8192         -
# 3  Short offline       Completed without error       00%      8168         -
# 4  Short offline       Completed without error       00%      8144         -
# 5  Short offline       Completed without error       00%      8120         -
# 6  Short offline       Completed without error       00%      8096         -
# 7  Short offline       Completed without error       00%      8080         -
# 8  Extended offline    Completed without error       00%         7         -
# 9  Short offline       Completed without error       00%         3         -

[dan@r710-01:~] $ 

As you can see, since test number 6 (at 8096 lifetime hours), a short test has been run every 24 hours.

OK, that proves to me that the configuration is working. How do I prove that I will get notified of errors? What does that look like?

Failed drive to the rescue

I just happen to have a failed drive. It is not completely dead, but it does produce errors via SMART tests. I recently replaced that drive with a new drive because of those failures

Let’s look at the configuration on that system:

$ cat /usr/local/etc/smartd.conf
/dev/da18 -a -d scsi -m dan@langille.org -s S/../.././11
/dev/da18 -a -d scsi -m dan@langille.org -s L/../02/./12

Here are are the selftest results:

[dan@knew:~] $ sudo smartctl -l selftest /dev/da18
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23333         -
# 2  Extended offline    Completed: read failure       00%     23324         94063856
# 3  Short offline       Completed without error       00%     23309         -
# 4  Extended offline    Interrupted (host reset)      90%     23295         -
# 5  Short offline       Completed without error       00%     23287         -
# 6  Selective offline   Completed without error       00%     23246         -
# 7  Selective offline   Completed: read failure       00%     23246         94063896
# 8  Selective offline   Completed: read failure       00%     23245         94063856
# 9  Selective offline   Completed: read failure       00%     23245         94063856
#10  Selective offline   Completed without error       00%     23227         -
#11  Selective offline   Completed: read failure       00%     23217         94063856
#12  Selective offline   Completed: read failure       00%     23217         94063872
#13  Selective offline   Completed: read failure       00%     23217         94063872
#14  Selective offline   Completed without error       00%     23199         -
#15  Selective offline   Completed: read failure       00%     23199         94063856
#16  Selective offline   Completed: read failure       00%     23191         94063856
#17  Selective offline   Completed: read failure       00%     23191         94063856
#18  Extended offline    Completed: read failure       00%     23183         94063856
#19  Short offline       Completed without error       00%     23173         -
#20  Extended offline    Interrupted (host reset)      90%     23104         -
#21  Short offline       Completed without error       00%     23104         -

[dan@knew:~] $ 

Tests 6-17 are my attempt at narrowing down the LBA in question, with the hopes of later using dd to overwrite those blocks and fix the drive. I gave up on that.

Test #3 and #1 were run according to the configuration as shown above.

Test #2 is the one I am here to talk about. I got an email about that failure:

To: dan@langille.org
Subject: SMART error (SelfTest) detected on host: knew
Message-Id: <20181104025014.C39E250151@knew.int.unixathome.org>
Date: Sun,  4 Nov 2018 02:50:14 +0000 (UTC)
From: Charlie Root <root@knew.int.unixathome.org>
X-Gm-Original-To: dan@langille.org

This message was generated by the smartd daemon running on:

   host name:  knew
   DNS domain: int.unixathome.org

The following warning/error was logged by the smartd daemon:

Device: /dev/da18, Self-Test Log error count increased from 0 to 1

Device info:
[ATA      TOSHIBA MD04ACA5 FP2A], lu id: 0x500003965bd0049b, S/N: 653DK7WAFS9A, 5.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

OH! SYSLOG, let’s look there.

Nov  4 02:50:15 knew smartd[1052]: Device: /dev/da18, Self-Test Log error count increased from 0 to 1

Umm, that’s not very exciting. There’s nothing there we didn’t already know.

For what it’s worth, I think the above email was generated by this script: /usr/local/etc/smartd_warning.sh which can also be used to invoke a custom plugin (see the next section for more information).

You can override the use of that script via the -M parameter.

smartd.conf I am using

In this section, I’m showing various examples I am using. They might inspire your own ideas.

gelt

This is from gelt, and I don’t recall how I named this server. I think it was someone’s IRC nickname.

[dan@gelt:~] $ sudo smartctl --scan
/dev/ada0 -d atacam # /dev/ada0, ATA device
/dev/ada1 -d atacam # /dev/ada1, ATA device
[dan@gelt:~] $ cat /usr/local/etc/smartd.conf
/dev/ada0 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../2/01) # /dev/ada0, ATA device
/dev/ada1 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../4/01) # /dev/ada1, ATA device

#/dev/ada0 -a -d atacam -m dan@langille.org -s L/../01/./01 # /dev/ada0, ATA device
#/dev/ada1 -a -d atacam -m dan@langille.org -s L/../01/./01 # /dev/ada1, ATA device
[dan@gelt:~] $ 

In the above, I run a short test every day during between 1500 and 1600, and a long test on Tuesday and Thursday between 0100 and 0200. You can see the two lines I commented out when I combined the short and long tests for a given device into one line.

I decided not to run all the long tests on the same day. I have no sound reason for this decision. These tests, on my 5TB drives, take about 8-9 hours. I decided I didn’t want 20 drives running tests all at the same time. I then applied this to all servers, despite them having much smaller drives.

I took care to make sure the short and long test did not overlap.

x8dtu

This is from x8dtu, the FreshPorts server, named after the m/b contained therein:

[dan@x8dtu:~] $ sudo smartctl  --scan
/dev/ada0 -d atacam # /dev/ada0, ATA device
/dev/ada1 -d atacam # /dev/ada1, ATA device
/dev/ada2 -d atacam # /dev/ada2, ATA device
/dev/ada3 -d atacam # /dev/ada3, ATA device
/dev/cd0 -d atacam # /dev/cd0, ATA device
/dev/ses0 -d atacam # /dev/ses0, ATA device
[dan@x8dtu:~] $ 

 $ cat /usr/local/etc/smartd.conf
/dev/ada0 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../1/01) # /dev/ada0, ATA device
/dev/ada1 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../3/01) # /dev/ada1, ATA device
/dev/ada2 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../5/01) # /dev/ada0, ATA device
/dev/ada3 -a -d atacam -m dan@langille.org -s (S/../.././15|L/../../7/01) # /dev/ada1, ATA device

Here, I run a short test daily between 1500 and 1600, and long tests on Monday, Wednesday, Friday, and Sunday.

slocum

This is slocum, named after Joshua Slocum, the first person to to sail single-handedly around the world.

[dan@slocum:~] $ sudo smartctl --martctl --scan
/dev/da0 -d scsi # /dev/da0, SCSI device
/dev/da1 -d scsi # /dev/da1, SCSI device
/dev/da2 -d scsi # /dev/da2, SCSI device
/dev/da3 -d scsi # /dev/da3, SCSI device
/dev/da4 -d scsi # /dev/da4, SCSI device
/dev/da5 -d scsi # /dev/da5, SCSI device
/dev/da6 -d scsi # /dev/da6, SCSI device
/dev/da7 -d scsi # /dev/da7, SCSI device
/dev/ada0 -d atacam # /dev/ada0, ATA device
/dev/ada1 -d atacam # /dev/ada1, ATA device
/dev/ses0 -d atacam # /dev/ses0, ATA device
[dan@slocum:~] $ cat /usr/local/etc/smartd.conf
/dev/da0 -a -d scsi -m dan@langille.org -s (S/../.././01|L/../(01|09|17)/./10) # /dev/da0, SCSI device
/dev/da1 -a -d scsi -m dan@langille.org -s (S/../.././02|L/../(02|10|18)/./10) # /dev/da1, SCSI device
/dev/da2 -a -d scsi -m dan@langille.org -s (S/../.././03|L/../(03|11|19)/./10) # /dev/da2, SCSI device
/dev/da3 -a -d scsi -m dan@langille.org -s (S/../.././04|L/../(04|12|20)/./10) # /dev/da3, SCSI device
/dev/da4 -a -d scsi -m dan@langille.org -s (S/../.././05|L/../(05|13|21)/./10) # /dev/da4, SCSI device
/dev/da5 -a -d scsi -m dan@langille.org -s (S/../.././06|L/../(06|14|22)/./10) # /dev/da5, SCSI device
/dev/da6 -a -d scsi -m dan@langille.org -s (S/../.././07|L/../(07|15|23)/./10) # /dev/da6, SCSI device
/dev/da7 -a -d scsi -m dan@langille.org -s (S/../.././08|L/../(08|16|24)/./10) # /dev/da7, SCSI device
[dan@slocum:~] $ 

Here, I run long test three times a month for each drive. For example, da0 gets tested on the 1st, 9th, and 17th of the month.

supernews

supernews is named after Supernews, who originally hosted this server for me.

/dev/twa0 -a -d 3ware,0 -m dan@langille.org -s (S/../.././15|L/../(01|10|19)/./01)
/dev/twa0 -a -d 3ware,1 -m dan@langille.org -s (S/../.././15|L/../(02|11|20)/./01)
/dev/twa0 -a -d 3ware,2 -m dan@langille.org -s (S/../.././15|L/../(03|12|21)/./01)
/dev/twa0 -a -d 3ware,3 -m dan@langille.org -s (S/../.././15|L/../(04|13|22)/./01)
/dev/twa0 -a -d 3ware,4 -m dan@langille.org -s (S/../.././15|L/../(05|14|23)/./01)
/dev/twa0 -a -d 3ware,5 -m dan@langille.org -s (S/../.././15|L/../(06|15|24)/./01)
/dev/twa0 -a -d 3ware,6 -m dan@langille.org -s (S/../.././15|L/../(07|16|25)/./01)
/dev/twa0 -a -d 3ware,7 -m dan@langille.org -s (S/../.././15|L/../(08|17|26)/./01)

These drives are attached to a 3Ware RAID controller. This server predates my use of ZFS.

This is similar to the previous example, but is more complex because of the 3Ware controller.

knew

knew is the new server, replacing an old one, but I didn’t want to call it new.

/dev/da0  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(01/./01|15/./13)) # /dev/da0, SCSI device
/dev/da1  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(02/./01|16/./13)) # /dev/da1, SCSI device
/dev/da2  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(03/./01|17/./13)) # /dev/da2, SCSI device
/dev/da3  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(04/./01|18/./13)) # /dev/da3, SCSI device
/dev/da4  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(05/./01|19/./13)) # /dev/da4, SCSI device
/dev/da5  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(06/./01|20/./13)) # /dev/da5, SCSI device
/dev/da6  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(07/./01|21/./13)) # /dev/da6, SCSI device
/dev/da7  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(08/./01|24/./13)) # /dev/da7, SCSI device
/dev/da8  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(09/./01|25/./13)) # /dev/da8, SCSI device
/dev/da9  -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(10/./01|26/./13)) # /dev/da9, SCSI device
/dev/da10 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(11/./01|27/./13)) # /dev/da10, SCSI device
/dev/da11 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(12/./01|28/./13)) # /dev/da11, SCSI device
/dev/da12 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(13/./01|01/./13)) # /dev/da12, SCSI device
/dev/da13 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(14/./01|02/./13)) # /dev/da13, SCSI device
/dev/da14 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(15/./01|03/./13)) # /dev/da14, SCSI device
/dev/da15 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(16/./01|04/./13)) # /dev/da15, SCSI device
/dev/da16 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(17/./01|05/./13)) # /dev/da16, SCSI device
/dev/da17 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(18/./01|06/./13)) # /dev/da17, SCSI device
/dev/da18 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(19/./01|07/./13)) # /dev/da18, SCSI device - known errors
/dev/da19 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(20/./01|08/./13)) # /dev/da19, SCSI device
/dev/da20 -a -d scsi    -m dan@langille.org -s (S/../.././00|L/../(21/./01|09/./13)) # /dev/da20, SCSI device

/dev/ada0 -a -d atacam  -m dan@langille.org -s (S/../.././00|L/../(22/./01|10/./13)) # /dev/ada0, ATA device
/dev/ada1 -a -d atacam  -m dan@langille.org -s (S/../.././00|L/../(23/./01|11/./13)) # /dev/ada1, ATA device
/dev/ada2 -a -d atacam  -m dan@langille.org -s (S/../.././00|L/../(24/./01|12/./13)) # /dev/ada2, ATA device
/dev/ada3 -a -d atacam  -m dan@langille.org -s (S/../.././00|L/../(25/./01|13/./13)) # /dev/ada3, ATA device

This is my most complex configuration. 20 hard drives, including the known bad drive, at da18, and four SSDs. Each gets a short test daily between 0000 and 0100. A long test is run twice months, either between 0100 and 0200, or 1300 and 1400. The day of the two long tests varies for each drive, so only one test occurs at a day, but most days see two tests.

I void running tests during days 29-31 of the month, so as to treat all months equally.

Interesting ideas

As I sit here at 8:22 AM on a sunny Sunday morning, I’m thinking that the -a directive allows multiple address in a comma separated form, with no spaces. I could also send email to other systems, such as Pushover.net, which I use for other services.

I’ve also just read this part of the smartd.conf documentation:

To test that email is being sent correctly, use the ‘-M test’
Directive described below to send one test email message on
smartd startup.

A custom script, from the plugin directory, can also be invoked. This directory is at /usr/local/etc/smartd_warning.d. The launching of the plugin is controlled by /usr/local/etc/smartd_warning.sh which you can override via the -M parameter.

Some examples are included at /usr/local/share/doc/smartmontools/examplescripts and I can see how they can be useful for building your own scripts.

This sounds like a Nagios alert can be raised. I have never used the push-type of Nagios notifications. That is something interesting to try.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive