NVIDIA driver in HVM qube stops working when increasing RAM [NVRM: fallen off the bus and is not responding to commands]

tada · March 1, 2023, 7:46pm

I am observing some strange behavior using an NVIDIA GPU (RTX edition) with Qubes version 4.1 on both Debian 11 and Fedora 36. Specifically, I have set up HVMs with my NVIDIA graphics card attached as a PCI device. I’ve also disabled memory balancing as recommended with this configuration.

Then I installed the NVIDIA drivers for each OS using their package managers. For Debian this involved installing the nvidia-driver package (version 470.161.03-1) that is known to work with my card (for Fedora one needs to add and use NVIDIA’s official repository first).

On my machine, using 2000MB (or less) of initial memory (max memory is disabled because the VM is not included in memory balancing), the nvidia-smi command works fine. The strange thing, however, is that increasing the RAM to 2500MB (or more) results in an error loading the NVIDIA driver which causes the nvidia-smi command to stop working.

Specifically, I note the following initial error message in the journald logs generated on boot (filtered for only the NVIDIA-related logs):

nvidia: module license 'NVIDIA' taints kernel.
nvidia: loading out-of-tree module taints kernel.
Disabling lock debugging due to kernel taint
nvidia: module verification failed: signature and/or required key missing - tainting kernel
nvidia-nvlink: Nvlink Core is being initialized, major device number 236
nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
     NVRM: fallen off the bus and is not responding to commands.
     NVRM: (PCI ID: 10de:249d) installed in this system has
     NVRM: The NVIDIA GPU 0000:00:06.0
nvidia: probe of 0000:00:06.0 failed with error -1
     NVRM: The NVIDIA probe routine failed for 1 device(s).
     NVRM: None of the NVIDIA devices were initialized.
nvidia-nvlink: Unregistered the Nvlink Core, major device number 236
modprobe: ERROR: could not insert 'nvidia_current': No such device
Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
nvidia-nvlink: Nvlink Core is being initialized, major device number 236
nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem

This issue causes nvidia-smi to output the following message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Furthermore, the NVIDIA GPU device file does not exist at the standard location of /dev/nvidia0 and the nvidia driver is not listed as a loaded kernel module.

On the other hand, with the smaller amount of memory, the NVIDIA driver loads successfully and produces the following initial logs instead:

nvidia: loading out-of-tree module taints kernel.
nvidia: module license 'NVIDIA' taints kernel.
nvidia: module verification failed: signature and/or required key missing - tainting kernel
nvidia-nvlink: Nvlink Core is being initialized, major device number 238
nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.161.03
[drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
[drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0 on minor 0
Inserted module 'nvidia_drm'

Similarly, nvidia-smi produces:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:00:06.0 Off |                  N/A |
| N/A   59C    P8    12W /  N/A |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The device file /dev/nvidia0 also exists and the following NVIDIA kernel modules are loaded:

nvidia_drm
nvidia_modeset
nvidia

The main issue appears to be that increasing RAM on the HVM results in the GPU to fall off the PCI bus on boot and not be responsive to nvidia-modprobe. I’ve even tried adding the following kernel options to the qube, which is using kernel version 5.15.89-1.fc32, however, this did not work:

pci=check_enable_amd_mmconf idle=nowait pcie_aspm=off

Any idea why this behavior is being observed and how to fix it? Is this a common problem when using NVIDIA graphics cards? Without adequate RAM, I cannot run the kinds of programs I’m interested in for GPU computing.

Additionally, I haven’t found a way to get the qube to use swap space. It seems that swap space is completely ignored?

neowutran · March 1, 2023, 8:18pm

Did you checked if that solve your issue ? (tolud issue)

tada · March 2, 2023, 7:38am

Thank you for that reference. I tried some of the recommended configurations though I was not able to resolve this issue. I should note that the GPU works using a smaller amount of RAM but stops working once the RAM is increased beyond a certain threshold. I do not know what would cause this kind of behavior as I would imagine the RAM configuration is independent from the GPU configuration? This makes me think there is something qubes-related going on though I could be mistaken.

tada · April 3, 2023, 12:41pm

A workaround for others that encounter this issue is to keep the RAM below the upper limit (~2Gb in my case) and add more swap space instead following the discussion of in this GitHub issue. I was not aware of this method before.

You may need to adjust the size of your VM’s volatile volume (using qvm-volume) to use more than the default (10Gb).