How To Prepare For A Hard Disk Failure
In April 2021, my 8TB media center hard disk suddenly stopped working. It started to make the “click of death” sounds while we were watching some movie and my heart sank. I knew this could be the end of life for the drive. I turned off the media center PC and turned it on but it was unable to discover the HDD. Bad luck struck. At least I am prepared for such kind of an event, so it was not too bad. The drive contains all my important files, photos, videos, music and movies. I purchased the drive in November 2016, so it lasted a good four and a half years. Not bad for a drive that has been running almost 24/7.
Thankfully, I routinely backup data from the drive on to another backup drive. I do this on a Saturday of every week. So at most I would risk losing one week’s worth of data. Still I cannot risk losing some precious data even for that one week. Moreover, there is a small likelihood that both drives fail at the same time. While the chances are quite slim, there is a possibility. Which is why I have different backup strategies. This is in sharp contrast to my previous post where I prefer not to have any backups whatsoever :).
Anyway, going back to the point, one such important data which I cannot risk losing is photos. I always backup important photos to Google Photos so even if I lose that data from the primary and the backup drives, I still have the original quality ones online somewhere. Of course, I cannot backup my entire 8 TB data onto the cloud because it would be too expensive. So just photos as of now.
Then I backup all important videos (usually of my kid) as private youtube videos :). I know there is a loss in quality, but at least I will have something if my primary hard disk and my backup drive fail simultaneously. I don’t have the risk of losing data even for a week because my phone which captures all the photos and videos is the third backup anyway.
The next critical data is my code. I write my code in my laptop and save it to a git repo, which is the primary. I make backup of the code every week to my media center hard disk which is the second backup. Then I have the backup of the media center drive, which is now the third backup. On top of that, whenever I commit some changes to the code, I do a
git push origin and do a backup in the cloud as well which is my fourth backup. So lots of redundancy there. I currently use private repos for each of my projects on bitbucket.
The reason I am writing all this now is because after my HDD crashed in 2021, I purchased a new one and now it seems like it is already failing! It has just been 1.5 years since I purchased it so this is just plain bad luck. The drive has a 3-year limited warranty from Seagate. So if it is really failing I have to figure out how to claim the warranty. According to seagate, I have warranty until 22 June 2024.
The drive is not making any clicking noise, so that is a good sign, but the S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) data is showing a lot of errors all of a sudden. It all started when I noticed a drop in transfer rate when I was copying files from my laptop to the media center drive. I immediate ran
smartctl -d sat -A /dev/sda and got this report –
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 157994840 3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 84 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 147834287 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14048 (116 115 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 84 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 053 045 040 Old_age Always - 47 (Min/Max 46/48) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 118 193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 10241 194 Temperature_Celsius 0x0022 047 055 000 Old_age Always - 47 (0 23 0 0 0) 195 Hardware_ECC_Recovered 0x001a 082 064 000 Old_age Always - 157994840 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10023 (58 57 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 28623651524 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 29879414367
You will notice a lot of
Raw_Read_Error_Rate errors which are compensated by
Hardware_ECC_Recovered. It seems like some data corruption is happening and the data is being error corrected. So far there has been no loss of data, just that the drive read speed is sometimes 1 MB/s and at other times it is 100 MB/s. You will also notice a high
Seek_Error_Rate but not many
Reallocated_Sector_Ct. Either way, I know it is heading towards disaster. If any of you encountered such a problem, please let me know. If all else fails, I will have to fork out Rs. 16,000 to buy a new drive unfortunately :(.
It turns out the slowness in the drive is related to my hard drive being based on SMR (Shingled magnetic recording) technology. After watching a video on the performance spectrum of SMR drives, I figured out that it is best to limit the transfer rate of data to allow the drive to do its shingling (if you can call it that). So now when I transferred data into the drive I used
rsync to limit the transfer rate to 20 MBps. I guess I could try something higher like 40 MBps but I wanted to play it safe for now.
I checked the disk completely with
sudo badblocks -b 4096 -wsv /dev/sda which did not report any errors. Then I restored all the data from my backup to my primary and it seems to be working fine. Will report back if I see something alarming. I continue to see high
Raw_Read_Error_Rate but it is being countered by an equal and exact number of
Hardware_ECC_Recovered. The important thing to note is that
Reallocated_Sector_Ct is still zero.