Recently I’ve been playing with NVMe to find out more about monitoring for wear.
Tried nvme-cli:
[17:49 r730-01 dvl ~] % nvme list Failed to scan topology: No such file or directory
Seems it is a known problem.
Went with this instead:
[17:52 r730-01 dvl ~] % sudo nvmecontrol devlist nvme0: Samsung SSD 980 PRO with Heatsink 1TB nvme0ns1 (953869MB) nvme1: Samsung SSD 980 PRO with Heatsink 1TB nvme1ns1 (953869MB)
With more information
[17:53 r730-01 dvl ~] % sudo nvmecontrol identify nvme0ns1 Size: 1953525168 blocks Capacity: 1953525168 blocks Utilization: 1797757632 blocks Thin Provisioning: Not Supported Number of LBA Formats: 1 Current LBA Format: LBA Format #00 Metadata Capabilities Extended: Not Supported Separate: Not Supported Data Protection Caps: Not Supported Data Protection Settings: Not Enabled Multi-Path I/O Capabilities: Not Supported Reservation Capabilities: Not Supported Format Progress Indicator: 0% remains Deallocate Logical Block: Read 00h Optimal I/O Boundary: 0 blocks NVM Capacity: 1000204886016 bytes Globally Unique Identifier: 00000000000000000000000000000000 IEEE EUI64: 002538b22140998d LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Best [17:53 r730-01 dvl ~] % sudo nvmecontrol identify nvme1ns1 Size: 1953525168 blocks Capacity: 1953525168 blocks Utilization: 1797629704 blocks Thin Provisioning: Not Supported Number of LBA Formats: 1 Current LBA Format: LBA Format #00 Metadata Capabilities Extended: Not Supported Separate: Not Supported Data Protection Caps: Not Supported Data Protection Settings: Not Enabled Multi-Path I/O Capabilities: Not Supported Reservation Capabilities: Not Supported Format Progress Indicator: 0% remains Deallocate Logical Block: Read 00h Optimal I/O Boundary: 0 blocks NVM Capacity: 1000204886016 bytes Globally Unique Identifier: 00000000000000000000000000000000 IEEE EUI64: 002538b221409d56 LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Best
Next, I got this from https://bsd.network/web/@normis@g.dodies.lv/115266285819611598:
[11:26 r730-01 dvl ~] % sudo nvmecontrol logpage -p 2 nvme0 SMART/Health Information Log ============================ Critical Warning State: 0x00 Available spare: 0 Temperature: 0 Device reliability: 0 Read only: 0 Volatile memory backup: 0 Temperature: 314 K, 40.85 C, 105.53 F Available spare: 100 Available spare threshold: 10 Percentage used: 0 Data units (512,000 byte) read: 10156699 Data units written: 7954064 Host read commands: 65865260 Host write commands: 154530656 Controller busy time (minutes): 162 Power cycles: 37 Power on hours: 19976 Unsafe shutdowns: 14 Media errors: 0 No. error info log entries: 0 Warning Temp Composite Time: 0 Error Temp Composite Time: 0 Temperature Sensor 1: 314 K, 40.85 C, 105.53 F Temperature Sensor 2: 318 K, 44.85 C, 112.73 F Temperature 1 Transition Count: 0 Temperature 2 Transition Count: 0 Total Time For Temperature 1: 0 Total Time For Temperature 2: 0
And based on https://bsd.network/web/@feld@friedcheese.us/115266300728028173 we have SMART:
[11:26 r730-01 dvl ~] % sudo smartctl -a /dev/nvme0 smartctl 7.5 2025-04-30 r5714 [FreeBSD 14.3-RELEASE-p3 amd64] (local build) Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Samsung SSD 980 PRO with Heatsink 1TB Serial Number: S6DVLJ0T207774T Firmware Version: 4B2QGXA7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 1,000,204,886,016 [1.00 TB] Unallocated NVM Capacity: 0 Controller ID: 6 NVMe Version: 1.3 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB] Namespace 1 Utilization: 920,451,907,584 [920 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 b22140998d Local Time is: Fri Sep 26 11:29:27 2025 UTC Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Log Page Attributes (0x0f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Maximum Data Transfer Size: 128 Pages Warning Comp. Temp. Threshold: 82 Celsius Critical Comp. Temp. Threshold: 85 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 8.49W - - 0 0 0 0 0 0 1 + 4.48W - - 1 1 1 1 0 200 2 + 3.18W - - 2 2 2 2 0 1000 3 - 0.0400W - - 3 3 3 3 2000 1200 4 - 0.0050W - - 4 4 4 4 500 9500 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 41 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 10,156,699 [5.20 TB] Data Units Written: 7,954,064 [4.07 TB] Host Read Commands: 65,865,260 Host Write Commands: 154,530,656 Controller Busy Time: 162 Power Cycles: 37 Power On Hours: 19,976 Unsafe Shutdowns: 14 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 41 Celsius Temperature Sensor 2: 45 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged Self-test Log (NVMe Log 0x06, NSID 0xffffffff) Self-test status: No self-test in progress No Self-tests Logged
We have numbers here, courtesy of https://bsd.network/web/@TomAoki@bsd.cafe/115266335070733088
[11:33 r730-01 dvl ~] % sudo nvmecontrol logpage -p 2 nvme0 | fgrep written Data units written: 7954064 [11:33 r730-01 dvl ~] % sudo nvmecontrol logpage -p 2 nvme0 | fgrep unit Data units (512,000 byte) read: 10156699 Data units written: 7954064
From https://bsd.network/web/@wollman@mastodon.social/115267385196485390 we have:
[11:44 r730-01 dvl ~] % sudo nvmecontrol logpage -p 2 nvme0 | grep 'Available spare' Available spare: 0 Available spare: 100 Available spare threshold: 10
Following on from that:
The second one is the one to monitor. When it gets below the 3rd value, replace the drive.
re: https://bsd.network/web/@wollman@mastodon.social/115271384378052392
That make me think it’s an easy Nagios check to write.