NVIDIA GPU passthrough into Linux HVMs for CUDA applications

crat0z · February 16, 2022, 10:18pm

While it is well known that gaming in Windows HVMs is easy to do now, I haven’t seen much talk about Linux HVMs. In my personal use case, I wanted to run CUDA applications in a “headless” manner. By this, I mean having being able to use Qubes seamless GUI with a CUDA device in an AppVM like:

If you would like to do this, below are the steps I took to create this. Note that I am not sure if this will work on 4.0. This guide will use a Standalone Fedora VM, and install the RPMFusion NVIDIA drivers. This should work for other OS/driver packages. I am just keeping it simple to give you a working setup fast.

Firstly, Follow the gaming in Windows HVMs guide for steps 1-3, i.e. ensure IOMMU groups are good, edit your GRUB options to hide your GPU device, and patch the stubdom-linux-rootfs.gz file.

Now we will create the VM. In this case, create a new StandaloneVM from a Fedora template.

2022-02-16_13-55

and configure your VM settings as such:

VCPUs doesn’t need to be as high if you don’t have that many cores. Just set it to an appropriate option. Now, in dom0, attach your GPU’s PCI devices to the VM. Again, following from the Windows guide, whether or not you require permissive mode is dependent on your system. In this example, my GPU is 01:00.0 and my GPU’s audio device is 01:00.1:

qvm-pci attach --persistent gpu-linux dom0:01_00.0 -o permissive=True -o no_strict_reset=True`
qvm-pci attach --persistent gpu-linux dom0:01_00.1 -o permissive=True -o no_strict_reset=True

Start your VM now. lspci -k should output something similar:

[user@gpu-linux ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel driver in use: ata_piix
	Kernel modules: pata_acpi, ata_generic
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel modules: i2c_piix4
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
	Subsystem: XenSource, Inc. Xen Platform Device
	Kernel driver in use: xen-platform-pci
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
	Subsystem: Red Hat, Inc. Device 1100
	Kernel modules: bochs_drm
00:05.0 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 10)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci
00:07.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070] (rev a1)
	Subsystem: PNY Device 136f
	Kernel modules: nouveau
00:08.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
	Subsystem: PNY Device 136f
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
[user@gpu-linux ~]$

As we can see, 00:07.0 corresponds to the GPU, and 00:08.0 corresponds to the GPU’s audio device. If you can get this far, the rest of the guide should work.

For now, as mentioned at the start of the post, we will keep it simple and use RPMFusion’s drivers. in your VM, we must enable RPMFusion’s repos:

[user@gpu-linux ~]$ sudo dnf config-manager --enable rpmfusion-{free,nonfree}{,-updates}

Now, following RPMFusion’s NVIDIA page:

sudo dnf update -y # and reboot if you are not on the latest kernel
sudo dnf install akmod-nvidia # rhel/centos users can use kmod-nvidia instead
sudo dnf install xorg-x11-drv-nvidia-cuda #optional for cuda/nvdec/nvenc support

DO NOT RESTART AT THIS POINT!!!
you must wait for akmod to finish building:

[user@gpu-linux ~]$ modinfo -F version nvidia
modinfo: ERROR: Module nvidia not found. // not finished yet
[user@gpu-linux ~]$ modinfo -F version nvidia
510.47.03 // completed!

STILL, DO NOT RESTART AT THIS POINT!!! if you restarted your VM already and it is not working, either create a new VM from step 1 or go to end for information on debugging a broken Xorg.

RPMFusion’s NVIDIA driver package creates a file which will BREAK XORG. If you are installing from another repo or for another OS, you might get extra unwanted files as well. In RPMFusion’s case, look here:

[user@gpu-linux ~]$ ls /usr/share/X11/xorg.conf.d/
10-quirks.conf  40-libinput.conf  71-libinput-overrides-wacom.conf  nvidia.conf
[user@gpu-linux ~]$ cat /usr/share/X11/xorg.conf.d/nvidia.conf 
#This file is provided by xorg-x11-drv-nvidia
#Do not edit

Section "OutputClass"
	Identifier "nvidia"
	MatchDriver "nvidia-drm"
	Driver "nvidia"
	Option "AllowEmptyInitialConfiguration"
	Option "SLI" "Auto"
	Option "BaseMosaic" "on"
EndSection

Section "ServerLayout"
	Identifier "layout"
	Option "AllowNVIDIAGPUScreens"
EndSection

Xorg will not be happy about this. While the file reads Do not edit, we are gonna do exactly that and just delete it. (someone else better with Xorg please provide a nicer solution)

[user@gpu-linux ~]$ sudo rm /usr/share/X11/xorg.conf.d/nvidia.conf 
[user@gpu-linux ~]$ ls /usr/share/X11/xorg.conf.d/
10-quirks.conf  40-libinput.conf  71-libinput-overrides-wacom.conf

After you delete the file, you can restart the VM. Your lspci -k should confirm that the driver loaded!

[user@gpu-linux ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel driver in use: ata_piix
	Kernel modules: pata_acpi, ata_generic
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel modules: i2c_piix4
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
	Subsystem: XenSource, Inc. Xen Platform Device
	Kernel driver in use: xen-platform-pci
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
	Subsystem: Red Hat, Inc. Device 1100
	Kernel modules: bochs_drm
00:05.0 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 10)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci
00:07.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070] (rev a1)
	Subsystem: PNY Device 136f
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia
00:08.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
	Subsystem: PNY Device 136f
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
[user@gpu-linux ~]$

nvidia-smi will also be able to tell you information:

[user@gpu-linux ~]$ nvidia-smi
Wed Feb 16 14:37:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:07.0 Off |                  N/A |
| 31%   38C    P0    N/A / 220W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Your GPU should now be work for CUDA applications!

OPTIONAL: TEST WITH PYTORCH

Here are the steps as of this date, note that this might change in the future. Before proceeding, MAKE SURE YOU HAVE 10GB+ IN YOUR HOME FOLDER. If not, open your VM’s settings and make sure your private storage is above 10GB as shown in the guide earlier.

Install conda, which can be found here. In this case, I will install the current latest version for python3.9:

[user@gpu-linux ~]$ wget -q https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh 
[user@gpu-linux ~]$ chmod +x Miniconda3-py39_4.11.0-Linux-x86_64.sh 
[user@gpu-linux ~]$ ./Miniconda3-py39_4.11.0-Linux-x86_64.sh

Follow the installation like normal. Once it’s done, open a new shell and you should have
(base) [user@gpu-linux ~]$ prompt.

Now we can install pytorch from here. In this case, I will install Stable 1.10.2, using Conda, with CUDA 11.3. Note that this will install pytorch and its required files into the base environment:

(base) [user@gpu-linux ~]$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Accept the installation, and wait. This will take a little bit. After pytorch is done installing, you can check it with python:

(base) [user@gpu-linux ~]$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

You should get the exact same output. Awesome!

OPTIONAL: Enable Coolbits

Coolbits allow for more control of your hardware, such as setting fan speeds and overclocking etc. It allows for apps like GreenWithEnvy to be used to easily control these settings.

Note: this procedure may mess up your Xorg, which you will have to follow tips at the bottom to recover from. DISCLAIMER: VERY HACKY STUFF ABOUT TO FOLLOW. I hope to find a better way to get this done, and I will update when it is found.

As of the time of this writing, Coolbits requires you to manipulate your Xorg configuration. It is a mystery as to why Xorg has anything to do with overclocking and fan speeds. Let’s start by installing gwe:

(base) [user@gpu-linux ~]$ sudo dnf install gwe

Starting gwe now, you will be greeted with this message:

2022-02-16_15-17

To fix this, we essentially need to make a fake screen for the NVIDIA GPU . We will create a nvidia.conf file. Open up /etc/X11/xorg.conf.d/nvidia.conf in your favourite text editing program, and paste this in:

Section "ServerLayout"
  Identifier	"Default Layout"
  Screen 0      "Screen0" 0 0 
  Screen 1      "Screen1"
  InputDevice   "qubesdev"
EndSection

Section "Screen"
# virtual monitor
    Identifier     "Screen1"
# discrete GPU nvidia
    Device         "nvidia"
# virtual monitor
    Monitor        "Monitor1"
    DefaultDepth 24
    SubSection     "Display"
       Depth 24
    EndSubSection
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    Option         "DPMS"
EndSection


Section "Device"
# discrete GPU NVIDIA
   Identifier      "nvidia"
   Driver          "nvidia"
   Option          "Coolbits" "28"
   # BusID           "PCI:0:7:0"
EndSection

Under the section Device, uncomment the BusID line and put in your GPU’s PCI ID. You can get the bus ID from lspci:

(base) [user@gpu-linux ~]$ lspci | grep -i NVIDIA
00:07.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070] (rev a1)
00:08.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

In my case, my GPU is 00:07.0, which corresponds to PCI:0:7:0 in the nvidia.conf file. Yours might be different. x:y.z is PCI:x:y:z in the nvidia.conf file.

DO NOT RESTART XORG YET

For Coolbits to work, your Xorg must be running as root. You can check this like

(base) [user@gpu-linux ~]$ ps aux | grep -i xorg
root       13403  0.0  0.3  11504  6124 tty7     S+   15:41   0:00 /usr/bin/qubes-gui-runuser user /bin/sh -l -c exec /usr/bin/xinit /etc/X11/xinit/xinitrc -- /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf > ~/.xsession-errors 2>&1
user       13414  0.0  0.0   4148  1336 ?        Ss   15:41   0:00 /usr/bin/xinit /etc/X11/xinit/xinitrc -- /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf
user       13449  1.0  4.1 258416 81260 ?        Sl   15:41   0:00 /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf

Here we can see /usr/libexec/Xorg is running as user. This is the case on Fedora at least. If you are NOT running as root, you need to edit a script file. DISCLAIMER: editing Qubes files is bad. Hopefully we can get this fixed in the future.

Open up the file /usr/bin/qubes-run-xorg, and at the bottom you will see this:

if qsvc guivm-gui-agent; then
    DISPLAY_XORG=:1

    # Create Xorg. Xephyr will be started using qubes-start-xephyr later.
    exec runuser -u "$DEFAULT_USER" -- /bin/sh -l -c "exec $XORG $DISPLAY_XORG -nolisten tcp vt07 -wr ->
else
    # Use sh -l here to load all session startup scripts (/etc/profile, ~/.profile
    # etc) to populate environment. This is the environment that will be used for
    # all user applications and qrexec calls.
    exec /usr/bin/qubes-gui-runuser "$DEFAULT_USER" /bin/sh -l -c "exec /usr/bin/xinit $XSESSION -- $XO>
fi

simply add the line DEFAULT_USER="root" before this, as such:

DEFAULT_USER="root"
if qsvc guivm-gui-agent; then
    DISPLAY_XORG=:1

    # Create Xorg. Xephyr will be started using qubes-start-xephyr later.
    exec runuser -u "$DEFAULT_USER" -- /bin/sh -l -c "exec $XORG $DISPLAY_XORG -nolisten tcp vt07 -wr ->
else
    # Use sh -l here to load all session startup scripts (/etc/profile, ~/.profile
    # etc) to populate environment. This is the environment that will be used for
    # all user applications and qrexec calls.
    exec /usr/bin/qubes-gui-runuser "$DEFAULT_USER" /bin/sh -l -c "exec /usr/bin/xinit $XSESSION -- $XO>
fi

now you can restart your VM/Xorg. nvidia-smi will now show that Xorg is running on the GPU (in a headless state though)

(base) [user@gpu-linux ~]$ nvidia-smi
Wed Feb 16 15:53:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:07.0  On |                  N/A |
| 30%   34C    P8    10W / 220W |     23MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     13901      G   /usr/libexec/Xorg                  21MiB |
+-----------------------------------------------------------------------------+

and gwe will work:

Congratulations! you have everything setup.

BROKEN XORG TIPS, GUI NOT WORKING FIXES ETC

When applications won’t start, you can access a terminal inside of the vm from dom0:
qvm-console-dispvm gpu-linux. This will launch a terminal:

2022-02-16_16-04

It is likely there’ll be a bunch of text. Don’t worry about it. Just type in user and press enter. It will login and you now have a terminal to fix Xorg.

Some tips for fixing Xorg problems:

qubes-gui-agent controls Xorg, and it uses the config at /etc/X11/xorg-qubes.conf to do so. While editing your Xorg files, you can restart qubes-gui-agent by sudo systemctl restart qubes-gui-agent and it will restart Xorg for you, instead of needing to restart the VM.
Xorg will check directories /usr/share/X11/xorg.conf.d, and /etc/X11/xorg.conf.d after reading the xorg-qubes.conf file. You must make sure that there are no other files that might be conflicting with Qubes.
The Xorg log files can be found at /home/user/.xsession_errors, and /home/user/.local/share/xorg/Xorg.0.log. If your Xorg.0.log contains “Operation not permitted” etc, you most likely have to run Xorg as root, see the Coolbits section.
While fiddling with my setup to get this working, I had to restart my computer a few times because the GPU gets put into a unrecoverable situation occassionally. If you are scratching your head as to why something isn’t working, try restarting your computer.

ddevz · May 31, 2022, 10:08pm

The instructions appear to have a problem with them, in that you are using the kernel from dom0:

but then akmod will end up compiling against the headers of the kernel version installed in the VM, and complaining.

Did using the dom0 kernel, but compiling against the headers of the kernel version installed in the fedora VM actually work for you, or did you forget to document a step?

hind0 · June 19, 2022, 9:48pm

The drivers don’t seem to install correctly following these steps.

modinfo -F version nvidia results in “modinfo: ERROR: Module nvidia not found.” even after waiting 15 minutes.
Then to be sure it doesn’t work, after vm restart, executing:
nvidia-smi results in: “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

Is it something to do with the vm kernel? I tried using default as shown in thia guide, and also tried againfrom scratch by specifying ‘none’, but neither worked. Is there some additional steps I’m missing? There wasn’t much info I could find on GPU passthrough with qubes to a linux vm, esp. using NVIDIA so I hope somone here can help.

If it helps:

I’m trying with qubes 4.1 (beta version I think), target vm is fedora-34 and gpu is 1660ti

hind0 · June 20, 2022, 3:37pm

Could you share the corrected solution for this problem? According to qubes docs Managing qube kernels | Qubes OS
my understanding is that if I specify the vm’s settings to use HVM and “none” for kernel, it should use the “vm kernel” default.
I don’t yet understand, what exactly is the difference between “vm kernel” default vs if I just left the settings untouched and used the normal default?

AFAIK, my steps outlined above should work for installing the NVIDIA module, is there something I’m missing?
Reading this form post:
https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/178157/unable-to-load-the-kernel-module-nvidiako/
I wonder if it’s because it already appear to have nouveau driver installed (shown in vm when using lspci -k.

Any help appreciated.

hind0 · June 20, 2022, 4:30pm

AFAICT the nouveau drivers shouldn’t be causing the problem, at least in this context for the method of installing the NVIDIA drivers on fedora (from what I can tell reading online like this How to Install NVIDIA Drivers on Fedora 36 Linux - LinuxCapable)

ddevz · June 21, 2022, 3:20pm

My “solution” was to use “none” for the kernel, then post the above to try to find out if there was a differnt way I was supposed to do it. His instructions show kernel “5.10.96-1.fc32” being used, which I believe means it’s using a kernel that’s installed in the qubes dom0. If that is the case, then if his screenshot is for a working system then he must have more steps that he did to get it to get the headers into the vm to complie against, but i do not know what those steps are.

Also, of interest, you need to look up the card that you have and find out which nvidia driver you need to use as there are seperate drivers for seperate “generations” of cards, meaning that you’d type something else for the package name in:

sudo dnf install akmod-nvidia

(it would still have the akmod string in the package name)

I got the driver to load successfully, but I still don’t have working CUDA.

Also, it’s unlikely, but people looking at this page might also be interested in OpenCL: Success Report: Passthrough GPU used by OpenCL success (kinda)

hind0 · June 21, 2022, 9:06pm

Which fedora version are you using? I still haven’t got the module to install for some reason, tried many other things without success. The only thing I could find which solved the problem for others is disabling UEFI secure boot. But I’m guessing this doesn’t apply to the VM I have, although I’m not sure how to check.

For whatever reason, the kmod refuses to build and install.

Why is cuda not working for you?

ddevz · June 21, 2022, 9:12pm

nouveau is the free version? I believe I blacklisted nouveau in /etc/mod{something} so the nvidia driver could load.

I don’t know unfortunately.

hind0 · June 22, 2022, 6:53pm

I tried that as well but it didn’t seem to solve it. I decided to try debian 10, and good news is it seems to be working, including cuda.

I have 1660ti passthrough to a debian 10 vm with driver version 470.129.06, and cuda version 11.4 according to nvidia-smi.

I tested it with blender and it seems to detect the gpu and cuda properly.

My steps to get this was pretty straight forward and just followed existing documentation from both qubes and nvidia:

Preliminary steps like checking IOMMU groups (I didn’t since I’d already had success with same system on Windows 10 HVM), and hiding the GPU pci devices and using some other GPU for dom0/gui (I use cpu’s iGPU for now).
Create new standalone qube vm based on debian 10 and change the settings, virtualization=HVM, kernel=none, disable memory balancing and set initial ram something reasonable, increase private storage to 10GB.
Pass through GPU pci devices as described by op’s post.
Start vm and sudo apt update and restart if kernel updates.
Install Nvidia driver as described here: How to Install Nvidia Drivers on Debian
5a.It might say conflict exists due to nouveau driver currently loaded. So fix this by restart (I didn’t need to manually blacklist nouveau, it seemed to automatically work after restarting the vm)
5b.Tried running nvidia-smi and it was failing with unable to determine the device handle for GPU (pci id): Unknown Error. I think I fixed this while installing cuda, maybe because it updated to driver.
Execute sudo apt install software-properties-common so we can use the command add-apt-repository in the next steps.
Install cuda. I first tried aws - Installing CUDA on Debian Machine 10.3 - Unix & Linux Stack Exchange but it failed, so then I tried as outlined here: Installation Guide Linux :: CUDA Toolkit Documentation and it worked. I think the most notable thing from the latter and the thing that made it work is the part about installing the cuda-keyring package.
7a.Similar to 5a except it’s for the older driver is loaded since it updated to a new driver. Restart vm.
Done. nvidia-smi should now work and also show cuda as installed correct. lspci -k should also reflect that the nvidia drivers are being used instead of nouveau.

Unlike the Windows 10 one I did a while back which displayed to a separate monitor, this vm runs in seamless mode. I haven’t looked into what causes that difference or how to configure it different but just thought it was worth mentioning.

hind0 · June 23, 2022, 1:07am

Unfortunately haven’t been able to get it useful yet, it seems very very slow. I think it is because it’s falling back to using llvmpipe. So I’ve tried fixing that but no success so far. I tried many things, including messing around with Xorg. I’m a a bit lossed and will probably have to learn more about Xorg. So far some things I noticed:

Including an Xorg.conf stopped the vm windows from being displayed at all.
Xorg will not stay running unless you modify qubes-run-xorg as described by op.
If Xorg stays running, all the GPU seems to output to the displays blank white screen.
Looking at the Xorg.0.log, the only thing I could notice is (II) modeset(GO): Refusing to try glamor on llvmpipe
and right below it (EE) modeset (GO): glamor initialization failed

ddevz · June 23, 2022, 2:10pm

Wait, are you trying to use a monitor plugged into the card, or trying to use the card for computations while still displaying to the normal “seamless” qubes screen?

If it’s the first then maybe you have to turn on debugging mode (in the qubes settings) to disable the attempt to display to the seamless screen?

hind0 · June 23, 2022, 3:25pm

I want to use it for rending work in blender, so actually would prefer seamless but will be happy with either at this point. Thanks for the tip I hadn’t considered that.

ddevz · June 23, 2022, 9:39pm

Sure thing. Say, I wonder if we need to find out what library/package they are using for openGL emulation and disable that?
(or maybe make a copy of the VM and try removing the package)

hind0 · June 23, 2022, 9:57pm

Still get white screen setting it to debug mode.

Just to clarify, I assume the GPU is not being used by some/all software in the vm since everything performs badly and for example inxi -Gx shows opengl is using llvmpipe. So my understanding looking online is that this could be because the GPU needs to be specified in xorg configuration. But my attempts at solving it has so far been unsuccessful, and I’m not sure if I’m looking in the right places. Tried many things with xorg but nothing has clicked yet.

I noticed the busid of GPU vga pci device changes between vm sessions which was causing xorg to not start, so I added a script to update it at each boot (a modifed solution I found online).

But now that is fixed, I can see xorg is running and recognizes my GPU, yet everthing else is still broken.

In the xorg logs, I found right under where nvidia module is being loaded: (WW) Falling back to old probe method for dummyqbs, is that normal or could give hint that nvidia module isn’t working correctly?

Assuming it is working fine and then xorg is not my problem – I wonder then how to fix it?

Hoping someone can share insights to a solution.

dmm · August 23, 2022, 8:41am

Thanks for posting these instructions!

I additionally had to make these changes to get my headless gpu qube working:

set the kernel to “none”
blacklist nouveau

$ cat /etc/modprobe.d/nouveau.conf 
blacklist nouveau

It seems to be working but it won’t boot with more than 3gb of ram, despite patching the stubdom rootfs file.

Is anyone else using more than 3gb of ram?

Is there an easy way to verify that the " max-ram-below-4g" argument is being passed or another way to try and debug this?

Thanks!

Fahrstuhl · February 17, 2023, 5:33pm

I was having problems with akmod-nvidia not building against the Qubes kernel. Using a VM kernel fixed akmod building but lead to infinite problems with X11 which I couldn’t fully solve.

So instead I installed the official Nvidia drivers and it worked! Here’s what I did:

sudo dnf install dkms because the installer uses dkms instead of akmod
download the official Linux driver from the Nvidia site and make the resulting .run file executable
start up a terminal VM for my CUDA VM through the qube-manager
stop the Qubes X11 integration with sudo systemctl stop qubes-gui-agent because otherwise the Nvidia installer complains about X running
run the installer as root and not let Nvidia configure X
restart the qubes-gui-agent

With this, I have a VM able to run CUDA and I can either use the Qubes GUI or start up a second X server on the card if needed.

slayoo · December 8, 2023, 12:25am

First, thanks for the above write up and all follow-up hints.

I’m trying to access a built-in nVidia GPU on my Dell Latitude 3520 for use with CUDA.

Following the above steps, I got to the point where the Standalone Fedora VM sees the GPU with lspci and the installed kernel module seems to load successfully, however when running sudo nvidia-smi there is just “No devices were found” printed, while dmesg output includes:

[   72.281766] NVRM: GPU 0000:00:06.0: Failed to copy vbios to system memory.
[   72.281970] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x30:0xffff:976)
[   72.282820] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
[   72.498475] NVRM: GPU 0000:00:06.0: Failed to copy vbios to system memory.
[   72.498651] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x30:0xffff:976)
[   72.499189] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0

There seem to be quite some discussions at nVidia forums mentioning the above messages (e.g., NVRM: failed to copy vbios to system memory. - #19 by generix - Linux - NVIDIA Developer Forums)… but it’s hard to distill it into any meaningful hint for me.

I’ve tried with two different kernels: 5.15 and 6.5.8 - same behavior.

I haven’t done anything with the IOMMU groups - if this could be the reason, please could you elaborate on what needs to be achieved there and how?

lspci -k gives:

[user@gpu-fedora ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel driver in use: ata_piix
	Kernel modules: pata_acpi, ata_generic
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
	Subsystem: Red Hat, Inc. Qemu virtual machine
	Kernel modules: i2c_piix4
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
	Subsystem: XenSource, Inc. Xen Platform Device
	Kernel driver in use: xen-platform-pci
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
	Subsystem: Red Hat, Inc. Device 1100
	Kernel modules: bochs
00:05.0 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 10)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci
00:07.0 3D controller: NVIDIA Corporation GP107M [GeForce MX350] (rev a1)
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

If anyone would have any clues, please post. Thanks!

JulianeSchweizer · December 16, 2023, 11:28am

I was finally able to get it to work on my laptop. I had to blacklist nouveau and also added below things to my grub config and restart whole laptop several times

swiotlb=65536 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 pcie_aspm=off rcutree.rcu_idle_gp_delay=5

but the problem is if i set RAM to anything more than 2GB i get

[    6.155057] nvidia 0000:00:07.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem
[    6.155120] NVRM: The NVIDIA GPU 0000:00:07.0
               NVRM: (PCI ID: 10de:249d) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    6.155967] nvidia: probe of 0000:00:07.0 failed with error -1
[    6.156023] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    6.156038] NVRM: None of the NVIDIA devices were initialized.
[    6.156333] nvidia-nvlink: Unregistered Nvlink Core, major device number 238
[    7.532805] systemd-journald[623]: /var/log/journal/810e858e12024bb1a49e0b6fa0fba4fc/user-1000.journal: Monotonic clock jumped backwards relative to last journal entry, rotating.
[    8.053502] nouveau 0000:00:07.0: unknown chipset (ffffffff)
[    9.604408] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

I have already Patched stubdom-linux-rootfs.gz to be able to add more than 3.5GB RAM. and it is working because without this patch HVM itself wouldn’t start.

I’ve attached o/p of dmesg and lspci -kvvv with 32GB memory passed to HVM. if anyone knows any workaround please share

dmesg lspci

tempmail · December 16, 2023, 4:05pm

How exactly (with 3.5GB or 2GB max-ram-below-4g), and did you try to patch only xen.xml instead (again with 2GB, and not 3.5GB)?

applejack · January 21, 2024, 12:45am

Now, after the latest Qubes Dom0 updates, my gpu_3n5G_LLM standalone boots to “No Bootable device” when the RAM is greater than 3000 mb. I re-applied the patch to the stubdom but I have the same error. Any thoughts will be appreciated. This is with a debian 11 template I migrated to deiban 12. Making a new gpu qube using the up to date debian-12 template to clone into the standalone I get a "failed to start unknown pci header type 127 " error message when I raise the ram above 3000 mb.