Increasing Unsafe Shutdowns count in SMART for SSD.
I think Qubes OS is not shutting down computer properly for me, or maybe everybody, because in SMART diagnostics of my NVME SSD drive I see non-zero and increasing field “Unsafe Shutdowns”. I have it on different SSDs with Qubes OS (at least R4.2 and R4.1).
That can mean that Qubes OS does not send “prepare for shutdown” command to SSD, and SSD does not flush buffers with data before being powered off. Even if user call sync in dom0 to flush data to disk, it still may be cached and buffered by SSD internally, and could be lost in case of shutting down without such command.
Is the hypothesis about the problem right, and data can really be lost or corrupted?
Do other users have significant amount of Unsafe Shutdowns in SMART? Please report yours. You can check if with this command in dom0:
sudo smartctl -a /dev/nvme0 | grep Unsafe
How can this problem be fixed or at least diagnosed?
I cannot reproduce this behavior.
Tried “Shutdown”, tried “Log Out → Restart” from the top-right hand menu; the “Unsafe shutdowns” count stays the same.
The only way to increase the “Unsafe shutdowns” count, on my hardware, is to trigger a Restart and then press the power button when the BIOS splash screen is shown (after QubesOS shuts down correctly). But this is not something I usually do, and tried it only to see the count increasing.
sync(), fsync() etc. are supposed to truly persist the data to disk:
[...] sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively.
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage
device) so that all changed information can be retrieved even if
the system crashes or is rebooted. This includes writing through
or flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
If it doesn’t persist, AFAIK that means the disk is buggy. (Also, some manufacturers probably have figured out that they can get great performance in benchmarks with one simple trick: lying)
@barto@rustybird Thank you. What count do you have in SMART? I have 100+. I did not hard-reset 100+ times. I searched Qubes OS forum for “Unsafe Shutdowns” and people on the forum also have like hundreds of unsafe shutdowns.
Well, thank you for this information. I would be happy if it were a part of NVME/PCI connection protocol that is used for nvme SSDs.
I mean, there is some command to reach safe shutdown, so it would be great if sync calls some command that is guaranteed to have the same cache-writing behavior.
I have 300+ “unsafe shutdowns”
The hardware is old (around 5 years) and it may have been used with other operating systems (Windows 10 and RedHat) before it came into my possession.
On the other hand, as I wrote, the count does not increase during normal shutdowns and reboots of QubesOS.
It’s a bit less here, and my count didn’t increase either on clean Qubes OS shutdown the last time I checked. I’m guessing it would increase for hard powering off during BIOS initialization or early boot, which I’ve done a lot.
I have one drive with 609 unsafe shutdowns, and 2688 power cycles.
I’ve not had any issues with data being corrupted, so I don’t think it’s something you need to worry about.
It probably just means the drives wasn’t notified before a shutdown happened, and it’s not the same as the drive being shutdown with unsaved dram cache.
Well, OK, let’s assume this Unsafe Shutdowns usually does not cause any data being lost (hopefully never).
But what’s the reason anyway?
We have this sample here: 100+, 100-. 300+, 600+ (wow).
I search for random SMART reports on forum and they have 100+ times, too.
Several ideas:
Hard resets. No, not plausible, I have this on almost new SSD, no hard resets in such amounts, it’s more likely to be power cycles.
Kernel or other Qubes OS updates? No, not in my case, I did not update so many times.
Double boot? No, we would have the count similar to power cycles, or half of them.
Hard powering off during BIOS initialization or early boot. Well, I would have it under 100 by a lot.
Caused by other operating systems (Windows 10 and GNU+Linux)? I think it’s also not my case.
Any other ideas?
I have only one suggestion: that it’s some kind of race condition, and some Qubes OS shutdowns are causing it anyway. Otherwise what can it be?
Since the kernel should already be doing this, and it appears to work at least some of the time for some people, have you tried switching dom0 to kernel-latest (maybe it works around a hardware quirk) and/or updating your drive firmware (maybe it fixes a bug)?
I don’t think we have representative data on this. But assuming that this shutdown command is indeed not always sent for some or even many people, on the surface that still doesn’t seem specific to Qubes OS.
Sounds wrong to me. If the kernel has flushed data to disk, AFAIK the disk is required to have persisted it. Although a clean shutdown could still be a worthwhile fail-safe in case of buggy/lying disk firmware.
Is the hypothesis about the problem right, and data can really be lost or corrupted?
Sounds unlikely, as others have explained.
Do other users have significant amount of Unsafe Shutdowns in SMART? Please report yours. You can check if with this command in dom0:
sudo smartctl -a /dev/nvme0 | grep Unsafe
71
How can this problem be fixed or at least diagnosed?
Perhaps you can try some simple testing, e.g. initiate writing some unimportant data, then shut down, then check the data after boot. I would also look at logs (or even increase logging somehow).