How To Prepare For A Hard Disk Failure

In April 2021, my 8TB media center hard disk suddenly stopped working. It started to make the “click of death” sounds while we were watching some movie and my heart sank. I knew this could be the end of life for the drive. I turned off the media center PC and turned it on but it was unable to discover the HDD. Bad luck struck. At least I am prepared for such kind of an event, so it was not too bad. The drive contains all my important files, photos, videos, music and movies. I purchased the drive in November 2016, so it lasted a good four and a half years. Not bad for a drive that has been running almost 24/7.

Thankfully, I routinely backup data from the drive on to another backup drive. I do this on a Saturday of every week. So at most I would risk losing one week’s worth of data. Still I cannot risk losing some precious data even for that one week. Moreover, there is a small likelihood that both drives fail at the same time. While the chances are quite slim, there is a possibility. Which is why I have different backup strategies. This is in sharp contrast to my previous post where I prefer not to have any backups whatsoever :).

Anyway, going back to the point, one such important data which I cannot risk losing is photos. I always backup important photos to Google Photos so even if I lose that data from the primary and the backup drives, I still have the original quality ones online somewhere. Of course, I cannot backup my entire 8 TB data onto the cloud because it would be too expensive. So just photos as of now.

Then I backup all important videos (usually of my kid) as private youtube videos :). I know there is a loss in quality, but at least I will have something if my primary hard disk and my backup drive fail simultaneously. I don’t have the risk of losing data even for a week because my phone which captures all the photos and videos is the third backup anyway.

The next critical data is my code. I write my code in my laptop and save it to a git repo, which is the primary. I make backup of the code every week to my media center hard disk which is the second backup. Then I have the backup of the media center drive, which is now the third backup. On top of that, whenever I commit some changes to the code, I do a git push origin and do a backup in the cloud as well which is my fourth backup. So lots of redundancy there. I currently use private repos for each of my projects on bitbucket.

The reason I am writing all this now is because after my HDD crashed in 2021, I purchased a new one and now it seems like it is already failing! It has just been 1.5 years since I purchased it so this is just plain bad luck. The drive has a 3-year limited warranty from Seagate. So if it is really failing I have to figure out how to claim the warranty. According to seagate, I have warranty until 22 June 2024.

Warranty available until 22 June 2024.

The drive is not making any clicking noise, so that is a good sign, but the S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) data is showing a lot of errors all of a sudden. It all started when I noticed a drop in transfer rate when I was copying files from my laptop to the media center drive. I immediate ran smartctl -d sat -A /dev/sda and got this report –

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       157994840
Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       84
Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       147834287
Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       14048 (116 115 0)
Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       84
Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
Airflow_Temperature_Cel 0x0022   053   045   040    Old_age   Always       -       47 (Min/Max 46/48)
G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       118
Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       10241
Temperature_Celsius     0x0022   047   055   000    Old_age   Always       -       47 (0 23 0 0 0)
Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       157994840
Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10023 (58 57 0)
Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       28623651524
Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       29879414367

You will notice a lot of Raw_Read_Error_Rate errors which are compensated by Hardware_ECC_Recovered. It seems like some data corruption is happening and the data is being error corrected. So far there has been no loss of data, just that the drive read speed is sometimes 1 MB/s and at other times it is 100 MB/s. You will also notice a high Seek_Error_Rate but not many Reallocated_Sector_Ct. Either way, I know it is heading towards disaster. If any of you encountered such a problem, please let me know. If all else fails, I will have to fork out Rs. 16,000 to buy a new drive unfortunately :(.

Update 2022-12-15

It turns out the slowness in the drive is related to my hard drive being based on SMR (Shingled magnetic recording) technology. After watching a video on the performance spectrum of SMR drives, I figured out that it is best to limit the transfer rate of data to allow the drive to do its shingling (if you can call it that). So now when I transferred data into the drive I used --bwlimit=20M in rsync to limit the transfer rate to 20 MBps. I guess I could try something higher like 40 MBps but I wanted to play it safe for now.

I checked the disk completely with sudo badblocks -b 4096 -wsv /dev/sda which did not report any errors. Then I restored all the data from my backup to my primary and it seems to be working fine. Will report back if I see something alarming. I continue to see high Raw_Read_Error_Rate but it is being countered by an equal and exact number of Hardware_ECC_Recovered. The important thing to note is that Reallocated_Sector_Ct is still zero.

Update 2022-12-15

You may also enjoy

Year In Review – 2020 Plans That Failed

Poor Man's Air Cooler

Post Retirement

Disabling Comments

Post Retirement

Leave a comment