I think it is time to replace my SSD
But first, I want to examine the output of smartctl, see below.
I can see one Failing LBA, but I don’t know how to translate this LBA to a file within a qube.
For HDD, I have used other tools, outside of Qubes
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Extended Completed: failed segments 15177 1823627480 1 7 0x2 0x81
1 Short Completed without error 15176 - - - - -
2 Extended Completed: failed segments 15175 1823627480 1 7 0x2 0x81
Depending on the controller’s capabilities (and firmware), it wouldn’t necessarily mean that the SSD itself is completely damaged. Sometimes, bad blocks are simply moved around or remapped to overprovisioned space or spare blocks, and errors appear and then decrease.
Try making an image of that drive.
dd bs=4k conv=sync,noerror if=/dev/nvme0n1 of=$BACKUPLOCATION status=progress
As long as the state of that drive isn’t fully attested (I wouldn’t trust it and replace it), I would encourage you to do any “forensic” work or recovery on/with that image.
Do you have backups?
Every SSD have more space than is reported. If there’s any memory cell failing is marked as defective and is suplemented by overprovisioning one.
Rellocated Sector Count is mapping that.
Uncorrectable Sector Count means that there is no spare memory cell to map in place of failed one.
Faiiling_LBA 0 means that SMART test completed. Number different than 0 means that SMART test was interrupted.
BTW, crashes led me to investigating the SSD.
@OvalZero I managed to get a backup after the crashes started. My previous one was older than I preferred.
It took a few tries before the backup worked
With HDDs, I had figured out the file owning the bad block, forcing a write to the sector, and forcing a remap. It sort of worked, but the drive was failing.
I expect the same thing here.
There are still blocks available to replace the bad block, but it hasn’t happened yet.
If it’s EXT4 then you need invoke filesystem check with e2fsck
Short answer: replace immediately! SSD are like brake pads on cars - replace while you can still drive to the shop, even if that obviously means that there is a little bit of life left in them.
Long answer:
HDD and SSD represent completely, radically different technologies and operating principles (like airplanes and balloons) - and therefore have different failure mechanisms. I wouldn’t apply any learning from one to the other.
SSD have a substantially limited number of ite cycles - per cell. Therefore controllers move data around constantly, in order to prevent repeated writing to the same cell. Repeated writing to one cell (like in RAM) would kill it almost instantly as it’s fast enough to reach the write-cycles limit very quickly.
Since the controllers move data around to prevent that - they necessarily need to even increase the overall # of writes. Example: One file (e.g. iso file downloaded and forgotten) is written once and then never again. Email-inbox (for example) is potentially written and re-written frequently - at least once for every spam mail received. For an oversimplified example, the controller would swap these files to equalize “the load” on the cells at more or less random intervals. Each “swap” contributes to more writes obviously.
The reason why this is important: If an SSD shows any kind of weakness, I would immediately pull all data off of it and replace it with a new one. Good opportunity to use a larger one, also - try double (given Moore’s law it will be the same price as the previous, smaller one).
Why double? The fuller an SSD is, the faster it ages - obviously because there are more writes necessary to move data around. (“write amplification”). Good opportunity also to delete (and “trim”) that unused iso in downloads.
Since the failure mechanism is based on the severly limited write-cycles of every individual cell and the actual write cycles are very statistical in nature , it is reasonable to assume that once a first cell has failed (reached its limit) - many others will soon start following. Since it would be too expensive to mark individual cells as defective, the controller has to mark bigger blocks or chunks - thus decreasing the capacity, thus increasing the relative fill % - thus accelerating all these effects.
While there will be plenty of people criticizing this post and knowing more and better, one thing is obvious: Anybody even just knowing about this forum is paranoid about their data. This here is a good point to be paranoid about and a new SSD is way cheaper than many other topics discussed here.
I view SSD as a “cheap” consumable like brake pads on my car. I don’t exchange the brake pads after the brakes have failed and I died unable to slow a hill descend - but significantly before that. Conveniently, brake pads display their life status with a measurable thickness - on SSD we have to rely on tools like s.m.a.r.t. - but even there, I would stay away from the “edge” and err on the side of caution. Brake pads are also replaced before their remaining thickness is “0.01mm” - even before that fatal failed hill descend.
1 Like