Oh oh, a sector problem

I woke up to this today. Complicating the matter: this server is destined to be transported tomorrow morning.

This is FreeBSD 8.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# grep smartd /var/log/messages
Apr 18 22:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 18 23:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 18 23:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 00:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 00:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 01:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 01:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 02:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 02:48:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 03:18:11 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 03:48:13 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 04:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 04:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 05:18:10 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 05:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 06:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 06:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 07:18:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 07:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 08:18:13 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 08:48:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 09:18:11 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 09:48:12 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 10:18:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 10:48:14 kraken smartd[1400]: Device: /dev/ada5, 2 Currently unreadable (pending) sectors
Apr 19 11:18:13 kraken smartd[1400]: Device: /dev/ada5, ATA error count increased from 0 to 5

I guess the good news is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# zpool status
  pool: storage
 state: ONLINE
 scan: scrub in progress since Fri Apr 19 03:11:14 2013
    3.00T scanned out of 9.72T at 95.9M/s, 20h25m to go
    103K repaired, 30.85% done
config:
 
        NAME                 STATE     READ WRITE CKSUM
        storage              ONLINE       0     0     0
          raidz2-0           ONLINE       0     0     0
            gpt/disk01-live  ONLINE       0     0     0
            gpt/disk02-live  ONLINE       0     0     0
            gpt/disk03-live  ONLINE       0     0     0
            gpt/disk04-live  ONLINE       0     0     0  (repairing)
            gpt/disk05-live  ONLINE       0     0     0
            gpt/disk06-live  ONLINE       0     0     0
            gpt/disk07-live  ONLINE       0     0     0
 
errors: No known data errors

But wait, is that gpt/disk04-live really /dev/ada5? Yes. Yes it is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ gpart list ada5
Geom name: ada5
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 3907029134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: ada5p1
   Mediasize: 2000188135936 (1.8T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e2
   rawuuid: 4ed25145-9dd7-11df-83c1-001b2151ab2d
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: disk04-live
   length: 2000188135936
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 3906619500
   start: 2048
Consumers:
1. Name: ada5
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e3

After a question from Peter Wemm on Twitter, I added:

1
2
3
4
5
6
7
8
9
10
11
12
# smartctl -P show /dev/ada5
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
 
Drive found in smartmontools Database.  Drive identity strings:
MODEL:              Hitachi HDS722020ALA330
FIRMWARE:           JKAOA28A
match smartmontools Drive Database entry:
MODEL REGEXP:       Hitachi HDS722020ALA330
FIRMWARE REGEXP:    .*
MODEL FAMILY:       Hitachi Deskstar 7K2000
ATTRIBUTE OPTIONS:  None preset; no -v options are required.

Full smartctl output follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# smartctl -a /dev/ada5
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK1130YAH324ST
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Apr 19 12:14:15 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (22771) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       102
  3 Spin_Up_Time            0x0007   124   124   024    Pre-fail  Always       -       609 (Average 552)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       7
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27078
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       175
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       175
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Lifetime Min/Max 19/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       8
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
 
SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 5 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:46.553  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:46.552  READ FPDMA QUEUED
 
Error 4 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:27.938  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:27.938  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:27.937  READ FPDMA QUEUED
 
Error 3 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce a8 3f e7 f9 40 00  26d+03:50:10.086  READ FPDMA QUEUED
  60 ce b0 71 e6 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 cd b8 a4 e5 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 ce c0 d6 e4 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
  60 33 c8 a3 e4 f9 40 00  26d+03:50:10.085  READ FPDMA QUEUED
 
Error 2 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce b0 3f e7 f9 40 00  26d+03:49:52.579  READ FPDMA QUEUED
  60 ce a8 71 e6 f9 40 00  26d+03:49:52.574  READ FPDMA QUEUED
  60 cd f0 a4 e5 f9 40 00  26d+03:49:52.574  READ FPDMA QUEUED
  61 04 b0 37 c5 71 40 00  26d+03:49:52.573  WRITE FPDMA QUEUED
  61 02 f0 0c 51 67 40 00  26d+03:49:52.572  WRITE FPDMA QUEUED
 
Error 1 occurred at disk power-on lifetime: 27077 hours (1128 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e 76 e2 f9 00
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 ce f0 d6 e4 f9 40 00  26d+03:49:34.795  READ FPDMA QUEUED
  60 33 e8 a3 e4 f9 40 00  26d+03:49:34.735  READ FPDMA QUEUED
  60 34 b0 6f e4 f9 40 00  26d+03:49:34.671  READ FPDMA QUEUED
  60 67 d0 08 e4 f9 40 00  26d+03:49:34.671  READ FPDMA QUEUED
  60 67 d8 a1 e3 f9 40 00  26d+03:49:34.668  READ FPDMA QUEUED
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24382         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Website Pin Facebook
Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment