NVMe controller drops and qubes_dom0 pool disappears under load

I want to report a bug but don’t want to create a GitHub account. I couldn’t fix it after a day of troubleshooting. Details:

Qubes version: 4.2.4
Symptom: The pool qubes_dom0 disappears, NVMe controller goes down, and qubes begin crashing.
Trigger: Happens under load (higher load → faster failure); does not occur when system is idle.
Changes: No configuration or setup changes were made. I think it started around the last dom0 update (maybe just before or after).

dmesg:

I think I’ve had the same thing happen to me.

It only happened with a NVMe drive I use to run HVM qubes, the NVMe drive is a Samsung 990 Pro.

When the controller crashes, the systems need to be fully shutdown, and cold booted. Normal (warm) reboot will not fix the issue.

Happend again today

[10777.734094] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[10777.734105] nvme nvme2: Does your device have a faulty power saving mode enabled?
[10777.734109] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[10777.775654] nvme 0000:05:00.0: Unable to change power state from D3cold to D0, device inaccessible
[10777.775707] xen: registering gsi 46 triggering 0 polarity 1
[10777.775723] Already setup the GSI :46
[10777.775780] nvme nvme2: Disabling device after reset failure: -19
[10777.785196] I/O error, dev nvme2n1, sector 746662192 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 2
[10777.785200] I/O error, dev nvme2n1, sector 110913552 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[10777.785241] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
[10777.785272] kworker/u64:4: attempt to access beyond end of device
               nvme2n1: rw=0, sector=3906815048, nr_sectors = 8 limit=0
[10777.785285] kworker/u64:4: attempt to access beyond end of device
               nvme2n1: rw=0, sector=3906939248, nr_sectors = 8 limit=0
...

Desktop or laptop?

People have a lot of problems with 990 Pro.
It looks like firmware bug.
The only solution so far is to set firmware to “Full Performance mode” with Samsung Magician), which void your warranty, use more power, heat the drive and shorten it’s life - but disconnections should stop.

It’s a desktop system.

It might just be coincidence, but the issue started after I upgraded to 4.3, I never had this issue when I used the drive with 4.2.

Maybe there was firmware update?
Latest 990Pro firmware

(7B2QJXD7) To address the intermittent non-recognition and blue screen issue. (Release: September 2025)

You can check firmware version with

sudo fwupdtool get-devices