Overview
Automated monitoring alerted me that the local
zpool in nyarlathotep was in the DEGRADED
state, with a significant number of write errors. The data
zpool also had a single error (a read error), but was still in the ONLINE
state, with the specific device (repairing)
.
As there were no SMART errors, and no further errors occurred (during four hours of monitoring) after cleaning dust out of nyarlathotep and checking the SATA cable connections, I judged that this was a physical connectivity issue, and not an indication of impending drive failure.
But, in case it is an indication of drive failure, there are some actions to lessen the potential impact.
I haven’t discovered any data loss. It’s likely that no important data was lost, as the only things which frequently write to the local
zpool are Prometheus (for metrics) and rTorrent (for session files), both of which are inconvenient but not catastrophic to lose.
Timeline
2021-09-01 12:00:00 BST - Alert fires
I receive an alert that the zpool health check has failed:
pool: data state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub in progress since Wed Sep 1 00:00:01 2021 12.2T scanned at 295M/s, 11.5T issued at 278M/s, 16.3T total 4K repaired, 70.43% done, 05:02:21 to go config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA206882-part2 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA27G6C6-part2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA22461Y ONLINE 1 0 0 (repairing) ata-ST10000VN0004-1ZD101_ZA27BW6R ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 ata-ST10000VN0008-2PJ103_ZLW0398A ONLINE 0 0 0 ata-ST10000VN0008-2PJ103_ZLW032KE ONLINE 0 0 0 errors: No known data errors pool: local state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 00:03:02 with 0 errors on Wed Sep 1 00:03:05 2021 config: NAME STATE READ WRITE CKSUM local DEGRADED 0 0 0 wwn-0x5002538e3053a94f-part1 DEGRADED 0 490 0 too many errors errors: No known data errors
(The running scrub is the automatic monthly scrub run at midnight on the 1st.)
The devices are of different makes and ages and are connected to different controllers on the motherboard:
ata-ST10000VN0004-1ZD101_ZA27BW6R
is a 10TB Seagate IronWolf (an HDD), is about 4 years old, and is connected to the motherboard SATA controller.wwn-0x5002538e3053a94f
is a 250GB Samsung 860 EVO (an SSD), is about 1 year old, and is connected to a PCI-e SATA controller.
Additionally, they’re in a different zpools holding different sorts of data. So a correlated failure looks unlikely.
2021-09-01 12:09 BST - SMART data
The SMART report for each device is fine, suggesting that the drives themselves are healthy.
For the HDD:
$ sudo smartctl -a /dev/disk/by-id/ata-ST10000VN0004-1ZD101_ZA22461Y smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Serial Number: ZA22461Y LU WWN Device Id: 5 000c50 0a3e77dd9 Firmware Version: SC60 User Capacity: 10,000,831,348,736 bytes [10.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Sep 1 12:11:08 2021 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 575) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 874) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x50bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 083 064 044 Pre-fail Always - 188088864 3 Spin_Up_Time 0x0003 089 086 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 172 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 095 060 045 Pre-fail Always - 3289159452 9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 33789 (68 199 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 181 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 098 000 Old_age Always - 20 189 High_Fly_Writes 0x003a 075 075 000 Old_age Always - 25 190 Airflow_Temperature_Cel 0x0022 059 044 040 Old_age Always - 41 (Min/Max 31/51) 191 G-Sense_Error_Rate 0x0032 098 098 000 Old_age Always - 5573 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 115 193 Load_Cycle_Count 0x0032 091 091 000 Old_age Always - 19545 194 Temperature_Celsius 0x0022 041 056 000 Old_age Always - 41 (0 18 0 0 0) 195 Hardware_ECC_Recovered 0x001a 009 001 000 Old_age Always - 188088864 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Pressure_Limit 0x0023 100 100 001 Pre-fail Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 29531 (13 52 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 78126395859 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 375155158238 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 11903 - # 2 Short offline Completed without error 00% 11903 - # 3 Short offline Completed without error 00% 11903 - # 4 Extended offline Aborted by host 80% 11903 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
And for the SSD:
$ sudo smartctl -a /dev/disk/by-id/wwn-0x5002538e3053a94f smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 860 EVO 250GB Serial Number: S3YJNS0N501806D LU WWN Device Id: 5 002538 e3053a94f Firmware Version: RVT04B6Q User Capacity: 250,059,350,016 bytes [250 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 1.5 Gb/s) Local Time is: Wed Sep 1 12:16:36 2021 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 85) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 8080 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 27 177 Wear_Leveling_Count 0x0013 094 094 000 Pre-fail Always - 102 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 071 054 000 Old_age Always - 29 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 8 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 42218038812 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
2021-09-01 12:36 BST - Temporally locating the failure
Prometheus shows that the local
pool entered DEGRADED
state at 06:09 UTC (05:09 BST):

Prometheus doesn’t appear to record error counts, so I can’t tell when the read error for the data
pool appeared, or whether the errors for the local
pool appeared in one spike or gradually.
2021-09-01 13:15 BST - Monitoring
The error counts in zpool status
haven’t changed since the initial alert, so the plan now is to wait approximately 5 hours for the ZFS scrub to finish, periodically checking the status.
While waiting I manually checked the following services, which all store data in the local
zpool:
- bookdb
- bookmarks
- prometheus
- flood
- rtorrent
- grafana
- influxdb
- syncthing
All worked fine, adding support to this not being an actual disk failure.
2021-09-01 15:00 BST - More errors for the SSD
The write error count in zpool status
for the SSD has gone up to 611, an increase of 121 since the initial alert.
In dmesg
, there is a repeating sequence of errors for ata9
and sdd
next to a ZFS error for the SSD:
$ dmesg ... [720033.773938] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [720033.773947] ata9.00: failed command: WRITE DMA EXT [720033.773952] ata9.00: cmd 35/00:c0:00:0b:0a/00:00:18:00:00/e0 tag 3 dma 98304 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [720033.773958] ata9.00: status: { DRDY } [720033.773962] ata9: hard resetting link [720034.243928] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [720034.244143] ata9.00: supports DRM functions and may not be fully accessible [720034.247265] ata9.00: supports DRM functions and may not be fully accessible [720034.250024] ata9.00: configured for UDMA/33 [720034.250038] sd 8:0:0:0: [sdd] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=30s [720034.250041] sd 8:0:0:0: [sdd] tag#3 Sense Key : Illegal Request [current] [720034.250043] sd 8:0:0:0: [sdd] tag#3 Add. Sense: Unaligned write command [720034.250045] sd 8:0:0:0: [sdd] tag#3 CDB: Write(10) 2a 00 18 0a 0b 00 00 00 c0 00 [720034.250047] blk_update_request: I/O error, dev sdd, sector 403311360 op 0x1:(WRITE) flags 0x700 phys_seg 13 prio class 0 [720034.250057] zio pool=local vdev=/dev/disk/by-id/wwn-0x5002538e3053a94f-part1 error=5 type=2 offset=205958545408 size=98304 flags=40080c80 [720034.250067] ata9: EH complete [720034.257336] ata9.00: Enabling discard_zeroes_data ...
This same pattern (with different timestamps) is happening a lot. The libata error messages page says about “timeout”:
Controller failed to respond to an active ATA command. This could be any number of causes. Most often this is due to an unrelated interrupt subsystem bug (try booting with ‘pci=nomsi’ or ‘acpi=off’ or ‘noapic’), which failed to deliver an interrupt when we were expecting one from the hardware.
ata9
is /dev/sdd
:
$ ls -l /sys/block/sd* | sed -e 's^.*-> \.\.^/sys^' -e 's^/host^ ^' -e 's^/target.*/^ ^' | while read Path HostNum ID do echo ${ID}: $(cat $Path/host$HostNum/scsi_host/host$HostNum/unique_id) done sda: 1 sdb: 2 sdc: 5 sdd: 9 sde: 10 sdf: 13 sdg: 17
And /dev/sdd
is the SSD:
$ file /dev/disk/by-id/wwn-0x5002538e3053a94f /dev/disk/by-id/wwn-0x5002538e3053a94f: symbolic link to ../../sdd
So these are errors for the SSD, which means something is wrong with the hardware itself or with the kernel.
2021-09-01 15:10 BST - Check hardware
I shut down nyarlathotep and check that the power and data cables for the SSD are firmly in place, and clean out excess dust. A thorough cleaning isn’t possible as I don’t have any compressed air.
2021-09-01 15:57 BST - Monitoring
I start a long SMART test:
$ sudo smartctl -t long /dev/disk/by-id/wwn-0x5002538e3053a94f smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 85 minutes for test to complete. Test will complete after Wed Sep 1 17:24:13 2021 BST Use smartctl -X to abort test.
And return to periodically checking zpool status
and dmesg
for SSD errors.
2021-09-01 16:00 BST - Uptime
The reboot reset the ZFS error counts to zero. So by looking at the uptime before the reboot we can determine the duration in which the previous errors appeared:

Previous uptime was 720870 seconds, or about 8.3 days. That window is too wide to do anything useful with.
2021-09-01 17:25 BST - SMART long test finishes
The test completes without error:
$ sudo smartctl -l selftest /dev/disk/by-id/wwn-0x5002538e3053a94f smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 8085 -
There have been no errors since checking the hardware.
2021-09-01 20:10 BST - ZFS scrub finishes
The ZFS scrub finishes, with no additional errors:
$ zpool status pool: data state: ONLINE scan: scrub repaired 4K in 20:10:00 with 0 errors on Wed Sep 1 20:10:01 2021 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA206882-part2 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA27G6C6-part2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA22461Y ONLINE 0 0 0 ata-ST10000VN0004-1ZD101_ZA27BW6R ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 ata-ST10000VN0008-2PJ103_ZLW0398A ONLINE 0 0 0 ata-ST10000VN0008-2PJ103_ZLW032KE ONLINE 0 0 0 errors: No known data errors pool: local state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 00:04:33 with 0 errors on Wed Sep 1 16:55:14 2021 config: NAME STATE READ WRITE CKSUM local DEGRADED 0 0 0 wwn-0x5002538e3053a94f-part1 DEGRADED 0 0 0 too many errors errors: No known data errors
2021-09-01 20:10 BST - End
Since there have been no errors in dmesg
or zpool status
since checking the hardware, I think the issue has been resolved, at least for now.
The ZFS error state is cleared:
$ sudo zpool clear local
Analysis
Given the symptoms:
- Frequent ATA timeouts, which the libata error messages page says could be “any number of causes”
- No SMART failures, even after a long test
- No noticeable data corruption
Which disappeared after clearing out dust and checking cable connections, I think this is more likely a temporary hardware failure than an actual issue with the disk. The cable management inside the case isn’t great.
The fact that two disks showed errors added some confusion, but given how unrelated the disks are:
- Different models
- Different ages
- Different SATA controllers
- Different usage patterns
And that they had very different error counts (1 on the HDD vs 490 on the SSD), I think that the HDD error was likely a coincidence, which zpool scrub
repaired as it should.
I was only alerted seven hours after the local
zpool entered DEGRADED
state, but that’s by design. Since the pool only has a single disk, I might not be alerted to a failure-in-progress in time to salvage things, but the pool only has a little persistent state, and most of that is backed up weekly.
Diagnosing the problem was difficult because I so rarely need to think about disk failure modes, so I had to look those things up again. And also “any number of causes” is annoyingly vague: it would be nice if the libata error went into more (human-readable) detail.
On the whole, while this ended up not being a “real” problem, it was good to run through a potential disk failure scenario.
Actions
Get error counts into Prometheus: This could have shed more light in the investigation: for instance, if both pools had an error spike at the same time, that would strongly suggest it’s not the drives themselves which are having issues.
Review backup scripts for completeness: I think everything from the
local
volume which needs to be backed up is, but I should confirm that. Most ofdata
is intentionally not backed up.Clean nyarlathotep and network cabinet: nyarlathotep was very dusty. It sits on top of the network cabinet, so that’s also likely very dusty.