Oh oh, a sector problem

I woke up to this today. Complicating the matter: this server is destined to be transported tomorrow morning.

This is FreeBSD 8.2

# grep smartd /var/log/messages
Apr 18 22:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 18 23:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 18 23:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 00:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 00:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 01:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 01:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 02:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 02:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 03:18:11 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 03:48:13 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 04:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 04:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 05:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 05:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 06:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 06:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 07:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 07:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 08:18:13 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 08:48:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 09:18:11 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 09:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 10:18:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 10:48:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 11:18:13 kraken smartd[1400]: Device: /dev/ada5, ATA error count increased from 0 to 5

I guess the good news is:

# zpool status
  pool: storage
 state: ONLINE
 scan: scrub in progress since Fri Apr 19 03:11:14 2013
    3.00T scanned out of 9.72T at 95.9M/s, 20h25m to go
    103K repaired, 30.85% done
config:

        NAME                 STATE     READ WRITE CKSUM
        storage              ONLINE       0     0     0
          raidz2-0           ONLINE       0     0     0
            gpt/disk01-live  ONLINE       0     0     0
            gpt/disk02-live  ONLINE       0     0     0
            gpt/disk03-live  ONLINE       0     0     0
            gpt/disk04-live  ONLINE       0     0     0  (repairing)
            gpt/disk05-live  ONLINE       0     0     0
            gpt/disk06-live  ONLINE       0     0     0
            gpt/disk07-live  ONLINE       0     0     0

errors: No known data errors

But wait, is that gpt/disk04-live really /dev/ada5? Yes. Yes it is:

$ gpart list ada5
Geom name: ada5
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 3907029134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: ada5p1
   Mediasize: 2000188135936 (1.8T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e2
   rawuuid: 4ed25145-9dd7-11df-83c1-001b2151ab2d
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: disk04-live
   length: 2000188135936
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 3906619500
   start: 2048
Consumers:
1. Name: ada5
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e3

After a question from Peter Wemm on Twitter, I added:

# smartctl -P show /dev/ada5
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Drive found in smartmontools Database.  Drive identity strings:
MODEL:              Hitachi HDS722020ALA330
FIRMWARE:           JKAOA28A
match smartmontools Drive Database entry:
MODEL REGEXP:       Hitachi HDS722020ALA330
FIRMWARE REGEXP:    .*
MODEL FAMILY:       Hitachi Deskstar 7K2000
ATTRIBUTE OPTIONS:  None preset; no -v options are required.

Full smartctl output follows:

# smartctl -a /dev/ada5
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK1130YAH324ST
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Apr 19 12:14:15 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (22771) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       102
  3 Spin_Up_Time            0x0007   124   124   024    Pre-fail  Always       -       609 (Average 552)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       7
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27078
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       175
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       175
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Lifetime Min/Max 19/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       8
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:46.552  READ FPDMA QUEUED

Error 4 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:27.938  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:27.938  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED

Error 3 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:10.086  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce b0 3f e7 f9 40 00  26d+03:49:52.579  READ FPDMA QUEUED
  60 ce a8 71 e6 f9 40 00  26d+03:49:52.574  READ FPDMA QUEUED
  60 cd f0 a4 e5 f9 40 00  26d+03:49:52.574  READ FPDMA QUEUED
  61 04 b0 37 c5 71 40 00  26d+03:49:52.573  WRITE FPDMA QUEUED
  61 02 f0 0c 51 67 40 00  26d+03:49:52.572  WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce f0 d6 e4 f9 40 00  26d+03:49:34.795  READ FPDMA QUEUED
  60 33 e8 a3 e4 f9 40 00  26d+03:49:34.735  READ FPDMA QUEUED
  60 34 b0 6f e4 f9 40 00  26d+03:49:34.671  READ FPDMA QUEUED
  60 67 d0 08 e4 f9 40 00  26d+03:49:34.671  READ FPDMA QUEUED
  60 67 d8 a1 e3 f9 40 00  26d+03:49:34.668  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24382         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top