NVIDIA GPU passthrough into Linux HVMs for CUDA applications

ddevz · June 21, 2022, 3:20pm

My “solution” was to use “none” for the kernel, then post the above to try to find out if there was a differnt way I was supposed to do it. His instructions show kernel “5.10.96-1.fc32” being used, which I believe means it’s using a kernel that’s installed in the qubes dom0. If that is the case, then if his screenshot is for a working system then he must have more steps that he did to get it to get the headers into the vm to complie against, but i do not know what those steps are.

Also, of interest, you need to look up the card that you have and find out which nvidia driver you need to use as there are seperate drivers for seperate “generations” of cards, meaning that you’d type something else for the package name in:

sudo dnf install akmod-nvidia

(it would still have the akmod string in the package name)

I got the driver to load successfully, but I still don’t have working CUDA.

Also, it’s unlikely, but people looking at this page might also be interested in OpenCL: Success Report: Passthrough GPU used by OpenCL success (kinda)

hind0 · June 21, 2022, 9:06pm

Which fedora version are you using? I still haven’t got the module to install for some reason, tried many other things without success. The only thing I could find which solved the problem for others is disabling UEFI secure boot. But I’m guessing this doesn’t apply to the VM I have, although I’m not sure how to check.

For whatever reason, the kmod refuses to build and install.

Why is cuda not working for you?

ddevz · June 21, 2022, 9:12pm

nouveau is the free version? I believe I blacklisted nouveau in /etc/mod{something} so the nvidia driver could load.

I don’t know unfortunately.

hind0 · June 22, 2022, 6:53pm

I tried that as well but it didn’t seem to solve it. I decided to try debian 10, and good news is it seems to be working, including cuda.

I have 1660ti passthrough to a debian 10 vm with driver version 470.129.06, and cuda version 11.4 according to nvidia-smi.

I tested it with blender and it seems to detect the gpu and cuda properly.

My steps to get this was pretty straight forward and just followed existing documentation from both qubes and nvidia:

Preliminary steps like checking IOMMU groups (I didn’t since I’d already had success with same system on Windows 10 HVM), and hiding the GPU pci devices and using some other GPU for dom0/gui (I use cpu’s iGPU for now).
Create new standalone qube vm based on debian 10 and change the settings, virtualization=HVM, kernel=none, disable memory balancing and set initial ram something reasonable, increase private storage to 10GB.
Pass through GPU pci devices as described by op’s post.
Start vm and sudo apt update and restart if kernel updates.
Install Nvidia driver as described here: How to Install Nvidia Drivers on Debian
5a.It might say conflict exists due to nouveau driver currently loaded. So fix this by restart (I didn’t need to manually blacklist nouveau, it seemed to automatically work after restarting the vm)
5b.Tried running nvidia-smi and it was failing with unable to determine the device handle for GPU (pci id): Unknown Error. I think I fixed this while installing cuda, maybe because it updated to driver.
Execute sudo apt install software-properties-common so we can use the command add-apt-repository in the next steps.
Install cuda. I first tried aws - Installing CUDA on Debian Machine 10.3 - Unix & Linux Stack Exchange but it failed, so then I tried as outlined here: Installation Guide Linux :: CUDA Toolkit Documentation and it worked. I think the most notable thing from the latter and the thing that made it work is the part about installing the cuda-keyring package.
7a.Similar to 5a except it’s for the older driver is loaded since it updated to a new driver. Restart vm.
Done. nvidia-smi should now work and also show cuda as installed correct. lspci -k should also reflect that the nvidia drivers are being used instead of nouveau.

Unlike the Windows 10 one I did a while back which displayed to a separate monitor, this vm runs in seamless mode. I haven’t looked into what causes that difference or how to configure it different but just thought it was worth mentioning.

hind0 · June 23, 2022, 1:07am

Unfortunately haven’t been able to get it useful yet, it seems very very slow. I think it is because it’s falling back to using llvmpipe. So I’ve tried fixing that but no success so far. I tried many things, including messing around with Xorg. I’m a a bit lossed and will probably have to learn more about Xorg. So far some things I noticed:

Including an Xorg.conf stopped the vm windows from being displayed at all.
Xorg will not stay running unless you modify qubes-run-xorg as described by op.
If Xorg stays running, all the GPU seems to output to the displays blank white screen.
Looking at the Xorg.0.log, the only thing I could notice is (II) modeset(GO): Refusing to try glamor on llvmpipe
and right below it (EE) modeset (GO): glamor initialization failed

ddevz · June 23, 2022, 2:10pm

Wait, are you trying to use a monitor plugged into the card, or trying to use the card for computations while still displaying to the normal “seamless” qubes screen?

If it’s the first then maybe you have to turn on debugging mode (in the qubes settings) to disable the attempt to display to the seamless screen?

hind0 · June 23, 2022, 3:25pm

I want to use it for rending work in blender, so actually would prefer seamless but will be happy with either at this point. Thanks for the tip I hadn’t considered that.

ddevz · June 23, 2022, 9:39pm

Sure thing. Say, I wonder if we need to find out what library/package they are using for openGL emulation and disable that?
(or maybe make a copy of the VM and try removing the package)

hind0 · June 23, 2022, 9:57pm

Still get white screen setting it to debug mode.

Just to clarify, I assume the GPU is not being used by some/all software in the vm since everything performs badly and for example inxi -Gx shows opengl is using llvmpipe. So my understanding looking online is that this could be because the GPU needs to be specified in xorg configuration. But my attempts at solving it has so far been unsuccessful, and I’m not sure if I’m looking in the right places. Tried many things with xorg but nothing has clicked yet.

I noticed the busid of GPU vga pci device changes between vm sessions which was causing xorg to not start, so I added a script to update it at each boot (a modifed solution I found online).

But now that is fixed, I can see xorg is running and recognizes my GPU, yet everthing else is still broken.

In the xorg logs, I found right under where nvidia module is being loaded: (WW) Falling back to old probe method for dummyqbs, is that normal or could give hint that nvidia module isn’t working correctly?

Assuming it is working fine and then xorg is not my problem – I wonder then how to fix it?

Hoping someone can share insights to a solution.

dmm · August 23, 2022, 8:41am

Thanks for posting these instructions!

I additionally had to make these changes to get my headless gpu qube working:

set the kernel to “none”
blacklist nouveau

$ cat /etc/modprobe.d/nouveau.conf 
blacklist nouveau

It seems to be working but it won’t boot with more than 3gb of ram, despite patching the stubdom rootfs file.

Is anyone else using more than 3gb of ram?

Is there an easy way to verify that the " max-ram-below-4g" argument is being passed or another way to try and debug this?

Thanks!

Fahrstuhl · February 17, 2023, 5:33pm

I was having problems with akmod-nvidia not building against the Qubes kernel. Using a VM kernel fixed akmod building but lead to infinite problems with X11 which I couldn’t fully solve.

So instead I installed the official Nvidia drivers and it worked! Here’s what I did:

sudo dnf install dkms because the installer uses dkms instead of akmod
download the official Linux driver from the Nvidia site and make the resulting .run file executable
start up a terminal VM for my CUDA VM through the qube-manager
stop the Qubes X11 integration with sudo systemctl stop qubes-gui-agent because otherwise the Nvidia installer complains about X running
run the installer as root and not let Nvidia configure X
restart the qubes-gui-agent

With this, I have a VM able to run CUDA and I can either use the Qubes GUI or start up a second X server on the card if needed.

slayoo · December 8, 2023, 12:25am

First, thanks for the above write up and all follow-up hints.

I’m trying to access a built-in nVidia GPU on my Dell Latitude 3520 for use with CUDA.

Following the above steps, I got to the point where the Standalone Fedora VM sees the GPU with lspci and the installed kernel module seems to load successfully, however when running sudo nvidia-smi there is just “No devices were found” printed, while dmesg output includes:

[   72.281766] NVRM: GPU 0000:00:06.0: Failed to copy vbios to system memory.
[   72.281970] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x30:0xffff:976)
[   72.282820] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
[   72.498475] NVRM: GPU 0000:00:06.0: Failed to copy vbios to system memory.
[   72.498651] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x30:0xffff:976)
[   72.499189] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0

There seem to be quite some discussions at nVidia forums mentioning the above messages (e.g., NVRM: failed to copy vbios to system memory. - #19 by generix - Linux - NVIDIA Developer Forums)… but it’s hard to distill it into any meaningful hint for me.

I’ve tried with two different kernels: 5.15 and 6.5.8 - same behavior.

I haven’t done anything with the IOMMU groups - if this could be the reason, please could you elaborate on what needs to be achieved there and how?

lspci -k gives:

[user@gpu-fedora ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel driver in use: ata_piix
	Kernel modules: pata_acpi, ata_generic
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel modules: i2c_piix4
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
	Subsystem: XenSource, Inc. Xen Platform Device
	Kernel driver in use: xen-platform-pci
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
	Subsystem: Red Hat, Inc. Device 1100
	Kernel modules: bochs
00:05.0 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 10)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci
00:07.0 3D controller: NVIDIA Corporation GP107M [GeForce MX350] (rev a1)
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

If anyone would have any clues, please post. Thanks!

JulianeSchweizer · December 16, 2023, 11:28am

I was finally able to get it to work on my laptop. I had to blacklist nouveau and also added below things to my grub config and restart whole laptop several times

swiotlb=65536 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 pcie_aspm=off rcutree.rcu_idle_gp_delay=5

but the problem is if i set RAM to anything more than 2GB i get

[    6.155057] nvidia 0000:00:07.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem
[    6.155120] NVRM: The NVIDIA GPU 0000:00:07.0
               NVRM: (PCI ID: 10de:249d) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    6.155967] nvidia: probe of 0000:00:07.0 failed with error -1
[    6.156023] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    6.156038] NVRM: None of the NVIDIA devices were initialized.
[    6.156333] nvidia-nvlink: Unregistered Nvlink Core, major device number 238
[    7.532805] systemd-journald[623]: /var/log/journal/810e858e12024bb1a49e0b6fa0fba4fc/user-1000.journal: Monotonic clock jumped backwards relative to last journal entry, rotating.
[    8.053502] nouveau 0000:00:07.0: unknown chipset (ffffffff)
[    9.604408] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

I have already Patched stubdom-linux-rootfs.gz to be able to add more than 3.5GB RAM. and it is working because without this patch HVM itself wouldn’t start.

I’ve attached o/p of dmesg and lspci -kvvv with 32GB memory passed to HVM. if anyone knows any workaround please share

dmesg lspci

tempmail · December 16, 2023, 4:05pm

How exactly (with 3.5GB or 2GB max-ram-below-4g), and did you try to patch only xen.xml instead (again with 2GB, and not 3.5GB)?

applejack · January 21, 2024, 12:45am

Now, after the latest Qubes Dom0 updates, my gpu_3n5G_LLM standalone boots to “No Bootable device” when the RAM is greater than 3000 mb. I re-applied the patch to the stubdom but I have the same error. Any thoughts will be appreciated. This is with a debian 11 template I migrated to deiban 12. Making a new gpu qube using the up to date debian-12 template to clone into the standalone I get a "failed to start unknown pci header type 127 " error message when I raise the ram above 3000 mb.

deeplow · January 22, 2024, 6:31pm

The symptoms look similar to what many of us are experiencing in

Please join us on that topic as it seems that the issue arises in a part that’s not specific to this guide.

Kykyx · April 27, 2024, 11:16pm

Can you please clarify some steps, so finally you have autoconfigured X or do that manually?
Is there any solution to autoconfig X with my custom configuration of gpus and monitors?

I have installed drivers and Cuda (nvidia-smi works fine), but it looks like now I should correctly configure Xorg. Can someone show me the way I should correctly add\modify 2nd device\screen\monitor\server in Xorg\X? Can I just add a second device, a screen, and a monitor? If yes, where can I find the parameters of my second monitor?

in my gpu-linux vm’s lspci:
00:03.0 VGA compatible controller: Device 1234:1111 (rev 02)
kernel-driver in use: bochs-drm
kernel-modules: bochs
00:06.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3-70 Ti] (rev a1)
kernel-driver in use: nvidia
kernel-modules: nouveau, nvidia-drm, nvidia

1st one is virtual VGA device from host system?

xrandr -q
Screen 0: minimum 64 x 64, current 1920 x 1200, maximum 32767 x 32767
DUMMY0 connected primary 1920x1200+0+0 0mm x 0mm
QB1920x1200 41.50*+
DUMMY1 disconnected
…
Do I have 2 screens here or it is the same one? I can’t see anything on monitor plugged in gpu passthrough into vm.

my /etc/X11/xorg.conf.d/10-nvidia.conf:

Section “OutputClass”
Identifier “nvidia”
MatchDriver “nvidia-drm”
Driver “nvidia”
Option “AllowEmptyInitialConfiguration”
Option “PrimaryGPU” “yes”
Option “SLI” “auto”
Option “BaseMosaic” “on”
EndSection

Section “OutputClass”
Identifier “intel”
MatchDriver “i915”
Driver “modesetting”
EndSection

my /etc/X11/xorg-qubes.conf
Section “Module”
Load “fb”
Load “glamoragl”
EndSection

Section “ServerLayout”
Identifier “Default Layout”
Screen 0 “Screen0” 0 0
InputDevice “qubesdev”
EndSection

Section “Device”
Identifier “Videocard0”
Driver “dummyqbs”
VideoRam 17101
Option “GUIDomID” “0”
EndSection

Section “Monitor”
Identifier “Monitor0”
horizSync 49-50
VertRefresh 41-42
Modeline “QB1920x1200” 96 1920 1921 …
EndSection

Section “Screen”
Identifier “Screen0”
Device “Videocard0”
Monitor “Monitor0”
DefaultDepth 24
Subsection “Display”
Viewport 0 0
Depth 24
Modes “QB1920x1200”
EndSubSection
EndSection

Section “InputDevice”
Identifier “qubesdev”
Driver “qubes”
EndSection

Fahrstuhl · April 28, 2024, 11:06am

In the end I switched to an Arch Linux VM built using Qubes builder and containing the Qubes windowing system. Fedora was annoying to get the driver installed and the drivers in Debian are very old.

I use a custom Xorg config named dedicated_gpu_X.conf:

Section "Device"
	Driver "nvidia"
	Identifier "Nvidia"
	BusID "PCI:0:8:0"
EndSection

and start a second X server with DISPLAY=:1 startx -- -config dedicated_gpu.conf

I need to check every boot that the PCI address of the GPU in lspci still matches the BusID in the X config.

Kykyx · April 28, 2024, 9:17pm

Can you please provide guide for Arch VM (or may be just quick steps what should be done)? Looks like 3 weeks passed since I’ve started these guides and still got semi-result because of xorg setup not easy for me right now.

Fahrstuhl · April 29, 2024, 1:26pm

For Qubes 4.1 I think I followed this guide: 'archlinux-minimal' template
For Qubes 4.2 I didn’t yet manage to get qubes-builder v2 working but I found there is an archlinux template in the qubes-templates-community-testing repositories as shown by qvm-template list in dom0.

As for installing the nvidia driver, I only installed the nvidia-open-dkms package, I don’t think I did any further configuration.