Solving (?) one type of frozen disk problem

After a kernel upgrade, my SSD stated going read-only. Here’s what seems to have solved the problem: (all fine, now > 1 day later of intensive use, whilst previously failing within a few minutes)

Adding this kernel boot option:

nvme_core.default_ps_max_latency_us=0

Explanation: Apparently some combinations of kernels and NVME SSDs cause problems, unless the above power save feature is disabled.

Error log

Since the SSD went read-only, no log messages related to the freeze, got saved — all lost after reboot. Using sudo or switching to root to run journalctl or dmesg also didn’t work, because of read-only disk errors. — So, to figure out what was happening, I ran journalctl -f and dmesg -Hw to start tailing the system logs, before the error happened. And then started some apps that accessed the SSD intensively. Then the disk froze, and I could switch to the console running journalctl and dmesg and take a photo of the logs. Here’s the logs, in case anyone wants to compare:

In text: (OCR translated, could be “typos”)

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme 0000:04:00.0: enabling device (0000 -> 0002)
xen: registering gsi 16 triggering 0 polarity 1
Already setup the GSI :16
nvme nvme0: Removing after probe failure status: -19
nvme0n1: detected capacity change from ... to 0
EXT4-fs warning (device dm-4): ... I/O error 10 writing to inode ... starting block
Buffer I/O error on device dm-4, logical block 
blk_update_request: I/O error, dev nme0n1, sector ...
...
device-mapper: thin: process_cell: dm_thin_find_block() failed: error= -5
...

Links

If you websearch for nvme nvme0 "controller is down" "will reset" "detected capacity change" "to 0"

you’ll find lots of things to read, I found this blog post helpful:

https://tekbyte.net/fixing-nvme-ssd-problems-on-linux/ (the same solution worked for him)

In someone else’s case, the PSU (power supply unit) was the problem:
Samsung SSD 980 NVMe controller shuts down : linuxhardware

A kernel patch with lots of links to this problem reported elsewhere:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f

And a nice overview of /dev/nvme0, nvme0n1, nvme0n1p1, p2, p3 and what they are:

https://wiki.archlinux.org/title/Solid_state_drive/NVMe

What the other boot options mean: dracut.cmdline(7) - Linux manual page — boot options you’ll see on the same boot line as nvme_core....

1 Like