Possible to Fix 2048MB RAM Limit for NVIDIA GPU Passthrough to HVM?

Peter45 · May 25, 2023, 12:04am

Greetings. I have successfully gotten my NVIDIA GPU to pass through to both a Debian 11 and Windows 11 guest. With 2048MB of RAM, both guests are able to utilize the GPU with NVIDIA drivers. However, when I try increase the RAM to even 1MB above 2048MB, the Windows 11 HVM no longer boots with the SeaBIOS error “No bootable device” and the Debian 11 HVM will still boot but the driver does not work. dmesg saying it has “fallen off the bus.”

My issue as described above seems to be almost identical to this post, although no solution was found there.

I have followed the guide by neowutran and tried patching stubdom-linux-rootfs as well as the alternative method metioned in the guide, none of which helped. I ran “qubes-dom0-update --action=downgrade xen-hvm-stubdom-linux” although these did not seem to help.

I am running Qubes OS 4.1.2 and I have also updated the kernel to the latest version to fix another issue but of course the HVMs are set to use their own kernel anyway. The Windows 11 guest was made using the tool from Github qvm-create-windows-qube by elliotkillick but it had the same problems even when making it without that tool.

I have been looking everywhere I can for a solution and haven’t been able to find anything, so I am wondering if it is even possible to do this in the current version of Qubes or if there is work being done to make it possible. Any ideas for possible solutions or even a clearer picture of whether or not this is possible now would be very very helpful.

disp6252 · May 25, 2023, 5:26am

Did you try to downgrade to version 1.2.3-1.fc32 and apply patch to it?
qubes-dom0-update --action=downgrade xen-hvm-stubdom-linux-1.2.3-1.fc32 xen-hvm-stubdom-linux-full-1.2.3-1.fc32

Peter46 · May 26, 2023, 4:44am

Thank you for the suggestion! Youre right I forgot to specify the version to downgrade to. I tried your command along with a clean installation (also forgetting to save the password for my original forum account that I stored in my vault, which is why I am responding with a new one ) and although there is no longer a hard limit at 2048MB, the VMs either behave strange or don’t work even if I increase the RAM only a little bit beyond 2048MB.

My windows 11 VM refuses to start with around more than 2200MB. It tells me the guest hasn’t started the display yet and never does anything. With amounts less than 2200MB but more than 2048, there are some screen artifacts and sometimes it will shut down almost immediately and tell me in a notification “qrexec-daemon.c:135:sigchld_parent_handler: Connection to the VM failed”

My Debian 11 guest will boot and the GPU will work fine for values slightly above 2048MB. However, when I use 2100MB it tell me it fail to connect to qrexec agent after 60 seconds and will shut down without showing me a terminal. For 2200MB, libxenlight will fail to create a new domain. For much larger values like 5000MB, it will boot but dmesg tells me it the GPU has fallen off the bus.

I don’t know exactly what to look for in dmesg, but here are some potentially interesting things (From a time when I used 2049MB):

[    0.293682] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.

[    0.427063] pci 0000:00:06.0: can't claim BAR 3 [mem 0x6280000000-0x6281ffffff 64bit pref]: no compatible bridge window
[    0.427313] pci 0000:00:01.1: can't claim BAR 4 [io  0xc200-0xc20f]: address conflict with 0000:00:06.0 [io  0xc200-0xc27f]

[    2.129298] xenbus_probe_frontend: Device with no driver: device/vbd/51712
[    2.129299] xenbus_probe_frontend: Device with no driver: device/vbd/51728
[    2.129300] xenbus_probe_frontend: Device with no driver: device/vbd/51744
[    2.129300] xenbus_probe_frontend: Device with no driver: device/vif/0

[    2.236106] piix4_smbus 0000:00:01.3: SMBus Host Controller not enabled!

[    3.512490] lp: driver loaded but no devices found
[    3.515298] systemd[1]: modprobe@drm.service: Succeeded.
[    3.515462] systemd[1]: Finished Load Kernel Module drm.
[    3.515988] systemd[1]: Finished Create System Users.
[    3.516484] systemd[1]: Starting Create Static Device Nodes in /dev...
[    3.517048] ppdev: user-space parallel port driver
[    3.519177] systemd[1]: Started Journal Service.
[    3.524701] systemd-journald[283]: Received client request to flush runtime journal.
[    3.585527] nvidia: loading out-of-tree module taints kernel.
[    3.585536] nvidia: module license 'NVIDIA' taints kernel.
[    3.585537] Disabling lock debugging due to kernel taint
[    3.597098] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    3.615805] nvidia-nvlink: Nvlink Core is being initialized, major device number 248

[    3.617089] xen: --> pirq=16 -> irq=40 (gsi=40)
[    3.618276] nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.669032] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
[    3.669538] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.182.03  Fri Feb 24 03:29:56 UTC 2023
[    3.682173] ACPI: Power Button [PWRF]
[    3.682222] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input5
[    3.682246] ACPI: Sleep Button [SLPF]
[    3.698689] input: PC Speaker as /devices/platform/pcspkr/input/input6
[    3.725275] bochs-drm 0000:00:03.0: vgaarb: deactivate vga console
[    3.726201] Console: switching to colour dummy device 80x25
[    3.726351] [drm] Found bochs VGA, ID 0xb0c5.
[    3.726352] [drm] Framebuffer size 16384 kB @ 0x83000000, mmio @ 0x85094000.
[    3.729660] [TTM] Zone  kernel: Available graphics memory: 993270 KiB
[    3.729660] [TTM] Initializing pool allocator
[    3.729662] [TTM] Initializing DMA pool allocator
[    3.729888] [drm] Initialized bochs-drm 1.0.0 20130925 for 0000:00:03.0 on minor 0
[    3.730536] fbcon: bochs-drmdrmfb (fb0) is primary device
[    3.733584] Console: switching to colour frame buffer device 128x48
[    3.736179] bochs-drm 0000:00:03.0: [drm] fb0: bochs-drmdrmfb frame buffer device
[    3.775196] xen: --> pirq=17 -> irq=45 (gsi=45)
[    3.775291] snd_hda_intel 0000:00:07.0: Disabling MSI
[    3.786299] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.182.03  Fri Feb 24 03:18:06 UTC 2023
[    3.813416] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
[    3.813417] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0 on minor 1
[    3.817135] xen:xen_evtchn: Event-channel device installed
[    3.830096] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:07.0/sound/card0/input7
[    3.830424] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:07.0/sound/card0/input8
[    3.830756] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:07.0/sound/card0/input9
[    3.830793] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:07.0/sound/card0/input10
[    3.830827] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:07.0/sound/card0/input11
[    3.882727] memmap_init_zone_device initialised 32768 pages in 0ms
[    4.024608] EXT4-fs (xvdb): mounted filesystem with ordered data mode. Opts: discard
[    4.088836] audit: type=1400 audit(1685073843.476:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=515 comm="apparmor_parser"
[    4.088937] audit: type=1400 audit(1685073843.476:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=516 comm="apparmor_parser"
[    4.088940] audit: type=1400 audit(1685073843.476:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=516 comm="apparmor_parser"
[    4.091551] audit: type=1400 audit(1685073843.480:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=518 comm="apparmor_parser"
[    4.091554] audit: type=1400 audit(1685073843.480:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=518 comm="apparmor_parser"
[    4.091555] audit: type=1400 audit(1685073843.480:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=518 comm="apparmor_parser"
[    4.093393] audit: type=1400 audit(1685073843.480:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/haveged" pid=519 comm="apparmor_parser"
[    4.097246] audit: type=1400 audit(1685073843.484:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/cups/backend/cups-pdf" pid=517 comm="apparmor_parser"
[    4.097249] audit: type=1400 audit(1685073843.484:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/cupsd" pid=517 comm="apparmor_parser"
[    5.275915] kauditd_printk_skb: 7 callbacks suppressed
[    5.275916] audit: type=1400 audit(1685073844.664:18): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=694 comm="cupsd" capability=12  capname="net_admin"

This is another thing that showed up at 2075MB:

[    5.451102] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    5.451647] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
[    5.542314] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    5.542766] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
[    5.675364] kauditd_printk_skb: 7 callbacks suppressed
[    5.675365] audit: type=1400 audit(1685075401.408:18): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=696 comm="cupsd" capability=12  capname="net_admin"

And here are some from 5000MB:

[    0.021786]   0 base 0080000000 mask 7FC0000000 uncachable
[    0.021787]   1 base 00C0000000 mask 7FE0000000 uncachable
[    0.021788]   2 base 00E0000000 mask 7FF0000000 uncachable
[    0.021789]   3 base 00F0000000 mask 7FF8000000 uncachable
[    0.021789]   4 base 00F8000000 mask 7FFC000000 uncachable
[    0.021790]   5 base 0200000000 mask 7E00000000 uncachable
[    0.021791]   6 disabled
[    0.021791]   7 disabled

[    1.640537] pci 0000:00:04.0: EHCI: unrecognized capability a0
[    1.640603] pci 0000:00:04.0: EHCI: unrecognized capability 00

[    2.270383] piix4_smbus 0000:00:01.3: SMBus Host Controller not enabled!
[    2.275980] ACPI: bus type USB registered
[    2.275991] usbcore: registered new interface driver usbfs
[    2.275994] usbcore: registered new interface driver hub
[    2.275998] usbcore: registered new device driver usb
[    2.277582] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    2.277855] ehci-pci: EHCI PCI platform driver
[    2.278551] ehci-pci 0000:00:04.0: EHCI Host Controller
[    2.278553] ehci-pci 0000:00:04.0: new USB bus registered, assigned bus number 1
[    2.278670] ehci-pci 0000:00:04.0: can't setup: -19
[    2.278682] ehci-pci 0000:00:04.0: USB bus 1 deregistered
[    2.278960] ehci-pci 0000:00:04.0: init 0000:00:04.0 fail, -19

[    3.642920] nvidia-nvlink: Nvlink Core is being initialized, major device number 247

[    3.647552] nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[    3.647599] NVRM: The NVIDIA GPU 0000:00:06.0
               NVRM: (PCI ID: 10de:2560) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    3.648112] nvidia: probe of 0000:00:06.0 failed with error -1
[    3.648125] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    3.648126] NVRM: None of the NVIDIA devices were initialized.
[    3.650011] bochs-drm 0000:00:03.0: vgaarb: deactivate vga console
[    3.650179] [drm:bochs_hw_init [bochs_drm]] *ERROR* ID mismatch
[    3.652472] nvidia-nvlink: Unregistered the Nvlink Core, major device number 247
[    3.690959] nvidia-nvlink: Nvlink Core is being initialized, major device number 247

[    3.693728] nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[    3.693754] NVRM: The NVIDIA GPU 0000:00:06.0
               NVRM: (PCI ID: 10de:2560) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    3.694195] nvidia: probe of 0000:00:06.0 failed with error -1
[    3.694204] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    3.694205] NVRM: None of the NVIDIA devices were initialized.
[    3.699454] nvidia-nvlink: Unregistered the Nvlink Core, major device number 247
[    3.722924] xen:xen_evtchn: Event-channel device installed
[    3.730439] xen: --> pirq=17 -> irq=45 (gsi=45)
[    3.730534] snd_hda_intel 0000:00:07.0: Disabling MSI
[    3.730605] snd_hda_intel 0000:00:07.0: number of I/O streams is 30, forcing separate stream tags
[    3.763901] memmap_init_zone_device initialised 32768 pages in 0ms
[    3.823953] EXT4-fs (xvdb): mounted filesystem with ordered data mode. Opts: discard
[    3.830660] hdaudio hdaudioC0D0: no AFG or MFG node found
[    3.830732] hdaudio hdaudioC0D1: no AFG or MFG node found
[    3.830796] hdaudio hdaudioC0D2: no AFG or MFG node found
[    3.830859] hdaudio hdaudioC0D3: no AFG or MFG node found
[    3.830921] hdaudio hdaudioC0D4: no AFG or MFG node found
[    3.830984] hdaudio hdaudioC0D5: no AFG or MFG node found
[    3.831046] hdaudio hdaudioC0D6: no AFG or MFG node found
[    3.831108] hdaudio hdaudioC0D7: no AFG or MFG node found
[    3.831136] snd_hda_intel 0000:00:07.0: no codecs initialized
[    3.907361] audit: type=1400 audit(1685075795.280:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=467 comm="apparmor_parser"
[    3.907363] audit: type=1400 audit(1685075795.280:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=470 comm="apparmor_parser"
[    3.907364] audit: type=1400 audit(1685075795.280:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=470 comm="apparmor_parser"
[    3.907364] audit: type=1400 audit(1685075795.284:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=475 comm="apparmor_parser"
[    3.907365] audit: type=1400 audit(1685075795.284:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=475 comm="apparmor_parser"
[    3.907366] audit: type=1400 audit(1685075795.284:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=475 comm="apparmor_parser"
[    3.907367] audit: type=1400 audit(1685075795.284:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/haveged" pid=476 comm="apparmor_parser"
[    3.909107] audit: type=1400 audit(1685075795.288:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/cups-browsed" pid=477 comm="apparmor_parser"
[    3.912415] audit: type=1400 audit(1685075795.292:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/cups/backend/cups-pdf" pid=474 comm="apparmor_parser"
[    4.089726] nvidia-nvlink: Nvlink Core is being initialized, major device number 247

[    4.096397] nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[    4.096442] NVRM: The NVIDIA GPU 0000:00:06.0
               NVRM: (PCI ID: 10de:2560) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    4.097108] nvidia: probe of 0000:00:06.0 failed with error -1
[    4.097131] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    4.097132] NVRM: None of the NVIDIA devices were initialized.
[    4.097621] nvidia-nvlink: Unregistered the Nvlink Core, major device number 247

Maybe none of that is helpful, but I can offer more logs and do more testing if requested. Thanks again for your help

disp6252 · May 26, 2023, 4:57am

Maybe the newer qubes tools depend on new stubdom as well and will fail with old stubdom. But it’s just a guess.

tempmail · May 26, 2023, 4:59am

What I don’t get is that no dev consider to fix this issue for 10 and more years of Qubes, yet “we would like to attract more users to Qubes”…

Peter46 · May 26, 2023, 5:35pm

Yes, all but the simplest programs I use require a powerful GPU. CAD, 3D modeling/rendering, photo/video editing, some small AI models, and even gaming, although I can understand that gaming isn’t a priority for Qubes. Anyone who uses these programs will have to dual boot or purchase two computers if GPU passthrough is impossible with more than a tiny amount of RAM, or even if it is too hard for the average user which it seems to be. I think I will go the dual boot route to use Qubes and still be able to do what I need with my GPU.

tempmail · May 28, 2023, 8:17pm

If you use dual boot, then you nullify the basic concept of Qubes.
Is it possible to run CAD with 2GB RAM and powerful GPU? Did you try it?

Peter46 · May 29, 2023, 12:37am

Would it really make Qubes that useless? The documentation only mentions 2 risks: compromising /boot and compromising BIOS firmware. The /boot compromise can be detected with AEM although compromised BIOS firmware is still a problem. Since the alternative is not using Qubes at all (I only have one machine with good enough hardware and the programs that Qubes cant run are necessary to me) I think dual boot might be better than nothing for me at least. It is certainly not a great solution and I may be overlooking some other risks.

I initially didn’t try to run CAD because the minimum requirements listed were 8GB of RAM, but I did install and test it now and it seems to start up with 2GB and work fine for simple tasks which is great news so thank you for the suggestion! Video editing and gaming are still out of the question, but at least Qubes can use a GPU for a few things.

likeafox · June 9, 2023, 9:11pm

Have people had success with GPU passthrough to a HVM with 8GB or more RAM?

I would like to know if there’s a chance it will work as I hope, before buying hardware to try it out myself. The hardware I’m looking at is i5 13600 and RTX 3060 Ti.

tempmail · June 10, 2023, 12:45pm

It is not possible regardless of hardware, but I run smoothly both Linux and Win11 HVMs with NVIDIA “passthroughed” to them. It depends on how and for what you intend to use HVM.

neowutran · June 10, 2023, 1:56pm

I have no issue, running a gaming HVM since 2019, currently I have a 4080 in a linux hvm with 32go ram dedicated

tempmail · June 10, 2023, 3:13pm

Which Qubes version? I have applied both of patches you indicated on your site, but can’t get more than 2118MB for WIn11 and 2064MB for Fedora HVM?

neowutran · June 10, 2023, 3:16pm

Work with 4.0, R4.1 & R4.2(development version), currently using R4.2. I am currently using a archlinux HVM.
But didn’t do anything more than what I wrote on my site

tempmail · June 10, 2023, 3:18pm

Great to hear. I never saw anyone except you confirming it works. How even to track this down, any idea?

tempmail · June 10, 2023, 3:20pm

What is more peculiar, after patching xen.xml even my old Win7 HVM without any device can’t be assigned with more than 2GB RAM now and it worked before with 6144MB.

neowutran · June 10, 2023, 3:40pm

check the name of your qube, and try to only use the patching stubdom method.
https://neowutran.ovh/qubes/articles/gaming_windows_hvm.html#patching-stubdom-linux-rootfs.gz

If your qube is named “gpu_test”, without any GPU passed through, after applying the stubdom-linux-rootfs.gz you can expect issues.
gpu_xxx → must have a gpu assigned
not starting with “gpu_” → work like before the patch

you could also need to patch “qemu-stubdom-linux-full-rootfs” the same way

tempmail · June 11, 2023, 8:13am

Thanks. Yes, I have no naming misuse, was strict with that.
As far as I can remember I tried what you suggested, but eventually came up that only xen.xml patching had the effect, while patching sttubdoms (only, and altogether with xen.xml) didn’t change anything.

But I will give it another try and will give feedback here.

Best regards

AmazingArchUser123 · August 9, 2023, 4:59pm

This issue still exists on the latest version. I can also confirm that dmesg lists NVIDIA as “fallen off the bus and is not responding to commands” when using > 2G of memory.

xvrhthxn · September 27, 2023, 11:34pm

I got it working with an NVIDIA RTX 4090.

Peter46 · September 28, 2023, 3:51am

It works! I can’t believe how simple and obvious your solution is. Thank you very much for this, I have spent so much time trying to figure it out.

@neowutran perhaps you can include a note in your guide for people who have a 2GB RAM limit rather than 3.5GB.

Thanks everyone for your help!