Secure AI Inference with Qubes OS: A GPU Passthrough & Ollama Guide

This guide is made using AI.
The model used was qwen3.5:27B

This guide will go through the steps needed to set up a qube for running GPU accelerated AI inference, and how to integrate AI in multiple AppVMs using a single GPU.

Hide the GPU from dom0

PCI devices cannot be assigned to multiple VMs simultaneously. You will crash the system if you attempt to pass a GPU to any VM while it remains attached to dom0. To prevent this, you must hide the GPU from dom0 before assigning it.

In dom0, use lspci -nn | grep -E ā€˜VGA|Audio’ to find the ID of the GPU. If an audio device is connected to the same GPU (common with NVIDIA and AMD cards), you must hide both devices.

01:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)  
01:00.1 Audio device: NVIDIA Corporation Device 22bd (rev a1)

In Dom0, edit /etc/default/grub and add the IDs to the GRUB_CMDLINE_LINUX variable:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX rd.qubes.hide_pci=01:00.0,01:00.1"

In dom0, rebuild grub.cfg, and reboot the system.

sudo grub2-mkconfig -o /boot/grub/grub.cfg
sudo reboot

After the reboot, use lspci -vvv to verify the driver used by the GPU is pciback, this means the device is hidden.

Kernel driver in use: pciback

Don’t proceed unless the device is hidden, you will crash the system if you attempt to pass the device while it’s attached to dom0.

Create a minimal template

You can skip this step, if you already have a template that can be used to create a standalone HVM.

In dom0, install debian-13-minimal.

qvm-template install debian-13-minimal

In dom0, clone the template.

qvm-clone debian-13-minimal debian-ai-hvm

In dom0, open a terminal with root privileges in the template.

qvm-run --user root debian-ai-hvm xterm

In the template, make the following changes.

# Update and upgrade system packages
apt update && apt upgrade -y 

# Install Qubes OS agent packages
apt install qubes-core-agent-passwordless-root \
qubes-core-agent-networking \
qubes-kernel-vm-support

# Install local kernel
apt install grub2 linux-image-amd64 linux-headers-amd64 

# Install system utilities
apt install vim curl pciutils 

# Configure locales
dpkg-reconfigure locales 

#Shutdown the template
reboot

Create an Ollama HVM

The Ollama HVM is the standalone VM running Ollama, and it will be where the GPU is attached. Making this VM standalone makes it easier to install the DKMS device drivers required by the GPU.

In dom0, create the AppVM

qvm-create --standalone --class AppVM --template debian-ai-hvm \
--label green \
--property maxmem=0 \
--property memory=16000 \
--property vcpus=8 \
ai-ollama

Adjust vcpus and memory as needed.

In dom0, open the terminal in the AppVM

qvm-run ai-ollama xterm

In the AppVM, make the following changes, then shut down the VM.

# Configure GRUB to allow the local kernel to boot
sudo grub-install /dev/xvda
sudo update-grub

# Shutdown the AppVM
sudo reboot

In dom0, change the AppVM to HVM with a local kernel.

qvm-prefs --set ai-ollama virt_mode HVM
qvm-prefs --set ai-ollama kernel ""

Resize disk space
You are going to need to resize the storage space for the private volume.

In the qube settings, change the max private storage space to 100 GB, less will do, but keep in mind many models are in the 20 GB range.

You need to increase the timeout value before you start the AppVM, or it’s very likely to crash when you boot it.

In dom0, increase the AppVM qrexec_timeout value

qvm-prefs --set ai-ollama qrexec_timeout 300

Boot the AppVM, expect it to take longer than normal to boot, and verify the disk is increased.

Shut down the AppVM, and restore the original timeout value.

qvm-prefs --set ai-ollama qrexec_timeout 60

Install NVIDIA drivers

You can skip this section if you do not have a NVIDIA GPU.

In dom0, open the terminal in the ollama VM

qvm-run ai-ollama xterm

Use lspci to verify that the GPU has been passed through to the VM.

In the Ollama VM, you may need to enable legacy cryptographic policies to use the NVIDIA drivers. Create the sequoia.config file:

sudo mkdir -p /etc/crypto-policies/back-ends
sudo vim /etc/crypto-policies/back-ends/sequoia.config

Add the following content to sequoia.config

[hash_algorithms]
sha1 = "always"

[asymmetric_algorithms]
rsa1024 = "always"

In the Ollama VM, install the NVIDIA CUDA drivers and shut down the VM

# Download and install the nvidia repository keys
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb -o cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install the nvidia drivers
sudo apt install nvidia-kernel-dkms cuda-drivers

# Shutdown
sudo reboot

Start the VM and verify that the GPU is working with nvidia-smi.

Install AMD drivers

You can skip this section if you do not have an AMD GPU.

In dom0, open the terminal in the ollama VM

qvm-run ai-ollama xterm

Use lspci to verify that the GPU has been passed through to the VM.

In the Ollama VM, install the amdgpu-install, amdgpu-dkms, and vulkan packages.

# Download and install the amdgpu-install package
curl -fsSL  https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb -o amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
sudo apt update

# Install amdgpu-dkms
sudo apt install amdgpu-dkms

# Download and install the libdisplay-info1 package from Ubuntu
curl -fsSL http://launchpadlibrarian.net/725825436/libdisplay-info1_0.1.1-2build1_amd64.deb -o libdisplay-info1_0.1.1-2build1_amd64.deb
sudo apt install ./libdisplay-info1_0.1.1-2build1_amd64.deb

# Install amd vulkan driver
sudo amdgpu-install --vulkan=amdvlk,pro

# Shutdown
sudo reboot

Note: The Vulkan driver requires libdisplay-info1, which is not available in the Debian 13 repository. Using the Ubuntu version (linked below) works around this issue:
https://launchpad.net/ubuntu/+source/libdisplay-info/0.1.1-2build1/+build/28132034

Install Ollama

In dom0, open the terminal in the ollama VM

qvm-run ai-ollama xterm

In the Ollama VM, run the ollama install script

curl -fsSL https://ollama.com/install.sh | sh

For AMD users
If you are using an AMD GPU, you must enable Vulkan support. You need to add Environment=ā€œOLLAMA_VULKAN=1ā€ to the systemd service file. Open /etc/systemd/system/ollama.service with a text editor and add this line inside the Service section:

You can verify that Ollama is running by executing the following command

ollama run qwen3.5:4b

You can open a second terminal in the Ollama VM and use the command ollama ps to verify Ollama is using the GPU.

Set up AppVM AI integration

In this example I’ll assume you have a VM called ai-browser with Brave installed, but it could be any VM with an application that can integrate with AI.

In dom0, create the file /etc/qubes/policy.d/30-user-networking.policy
with the following content

qubes.ConnectTCP +11434 ai-browser @default allow target=ai-ollama

This policy allows ai-browser to access ai-ollama, you will need to repeat this step for each new VM you want to allow to use the Ollama API.

In the VM ai-browser, configure qrexec to bind port 11434 from ai-ollama to localhost on ai-browser.

Create the file /rw/config/ollama.socket add the following content:

[Unit]
Description=Ollama socket

[Socket]
ListenStream=127.0.0.1:11434
Accept=true

[Install]
WantedBy=socket.target

Create the file /rw/config/ollama@.service, add the following content:

[Unit]
Description=Ollama service

[Service]
ExecStart=qrexec-client-vm '' qubes.ConnectTCP+11434
StandardInput=socket
StandardOutput=inherit
Restart=always
RestartSec=3

Edit the file /rw/config/rc.local add the following content:

cp -r /rw/config/ollama* /lib/systemd/system/
systemctl daemon-reload
systemctl start ollama.socket

The changes in /rw/config need to be repeated in all VMs that are allowed to access the Ollama API.

Restart the VM ai-browser.

Integrate the Ollama API into Brave Leo to verify everything is working.
In the VM ai-browser, open Brave settings, open the Leo settings, and change the settings to use Ollama

Label: qwen3.5:4b
Model: qwen3.5:4b
endpoint: http://localhost:11434/v1/chat/completions
Default model: qwen3.5:4b

You should now be able to use Leo with the Ollama API as the AI back-end.

13 Likes

Thanks for the clear guide documenting an intrinsically elaborate process.

For extra security points, and a privacy bonus, it would be cool if ai-ollama and ai-browser could be constructed without direct (netvm) or indirect (updates proxy) WAN exposure. Package downloads in templates or in a helper VM that funnels to the always-offline LLM VMs. I feel ultimately that’s the model we’re heading toward in every software stack with a complex supply chain-- zero trust, walls to protect against compromise and phone-home. ā€œSecure, private, offline AI inference with Qubes OSā€.

1 Like

You can use qrexec with offline VMs. When I’ve set up everything, I remove the netVM from the Ollama VM.

You don’t need Ollama to download the models, you can download the models in GUFF format, copy the file to the Ollama VM and manually add the model to Ollama.

If you really want to, you could probably set up everything offline. I’m fine with just running the AI offline.

3 Likes

After running this command and rebooting the Qube get’s started (Notification is shown) but qvm-run ai-llama xterm never gives the terminal of the qube. So it seems to hang.

Looking at the console it seems to boot and waits for log-in:

Setting up swapspace version 1, size = 1073737728 bytes
UUID=83ede16a-d133-4e55-9614-724a5a159452
/dev/xvda3: clean, 62376/1299984 files, 855031/5190656 blocks
[    1.758685] piix4_smbus 0000:00:01.3: SMBus Host Controller not enabled!
[    2.176549] nouveau 0000:00:06.0: unknown chipset (1b5000a1)
[    2.558544] 
[    4.309178] [drm:nv_drm_dev_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000006] Failed to allocate NvKmsKapiDevice

Debian GNU/Linux 13 ai-llama hvc0

ai-llama login: ^[[43;1R^[[43;74R^[[43;74R^[[43;74R

This is the Xen state of the machine.

sudo xl list | grep ai-llama
ai-llama                                    29 31984     8     r-----     593.0
ai-llama-dm                                 30   144     1     -b----      18.4

Any tips to get this working?

Kind regards, Bloged

Have you tried searching the internet for the error message Failed to allocate NvKmsKapiDevice?

I searched for it before; couldn’t get this to work so thought to get console working again. After this comment looked further into it and you should use nvidia-open instead of the install command; then you will be missing the cuda-toolkit which you should also install.
sudo apt install nvidia-open cuda-toolkit

That fixes the non-responsive issue!

Regards, Bloged

Thank you very much for this detailed guide. I thought I would hit some big issues but the guide was very clear and all the errors were mine. Once again, I am doing things in Qubes that i have no ability to, thanks to @renehoj :slight_smile:

Everything is working although there seems to be a big difference between using leo vs directly typing into ai-llama terminal. Leo has a spinning wheel icon for a long time (at first i thought it was broken) and there doesn’t seem to be an easy way to stop the search (other than closing brave browser). The path to the results seem very different. The terminal does a stream of consciousness while leo spins and spins until it thinks it has reached a good answer. If i ask it something difficult it just never answers

Ai-llama terminal is pretty fast and immediate. Lots of fun figuring out where its expertise lies

A couple of notes on my experience using the guide.

  • Regenerating the grub needed

Sudo grub2-mkconfig -o /boot/grub2/grub.cfg (adding the 2 to the location)

  • Was I supposed to add the gpu to ai-llama in settings/devices (i did). Or perhaps in cli. I didn’t see attaching it in the instructions

  • I needed to restart ai-llama after adding the VULKAN environment line. Ai-llama was using the CPU not the GPU

1 Like

A couple of usage notes, perhaps helpful to ollama newbies (i can remove if you want)

I use the chat directly, brave leo is slower

You can use cat piped to send text files to ollama (like for summaries etc)

Ollama replies it can only remember the current chat but if you save the model it will remember your preferences like requested aliases and shortcuts

You can add flags to run command for example --think=false

Its quite handy for cli usage lookups or when you can’t remember an exact command. Actually better than a google search because it remembers what OS you use and other customizations

If you just want to use the LLM as an AI chatbot, then install open-webui on top of Ollama.

Alternatively, you can replace Ollama with LM Studio, it comes with a GUI out-of-the-box and it can function as an API server.

I have switched from Ollama to using LM Studio, and then I have a second VM running Hermes Agent with open-webui as the front-end.

1 Like

Hi, I’ve been following the guide up to the installation of an LLM in ollama. When trying to test model inference using the AMD GPU, ollama gets stuck in a state where any letter I type is automatically deleted [then after a certain time, the process aborts automatically]. In other words, it can’t perform model inference through the GPU. When it’s stuck in that state where typing deletes everything, I open another xterm and run ollama ps, and it shows that the model is loaded on the GPU, but as I said, I can’t type anything (IMAGE 01). I’ve done some testing and noticed that when I stop ollama with (sudo systemctl stop ollama) and then restart it with (ollama serve) and run the model, it does allow me to type and respond, but it only uses the CPU. This is what (ollama ps) shows (IMAGE 02). IMAGE 02 shows that ollama indicates I need to add the variable (OLLAMA_VULKAN=1). Following the guide, I added (Environment=OLLAMA_VULKAN=1) to the file /etc/systemd/system/ollama.service, but it still shows the same issue. By the way, I had to pass the VGA and AUDIO GPU devices to the Qubes, and that’s not mentioned in the guide (IMAGE 03). Finally, I’d like to thank you on behalf of all the AI users on Qubes.

1 Like

The environment variable in ollama.service is only being used when you start Ollama as a service. If you are running Ollama from the terminal, you need to use the command export OLLAMA_VULKAN=1 before starting Ollama.

It’s hard to say why it’s not working, but try to export the variable and start Ollama from the terminal, the output might give you a clue about what is happening. The system logs might also have some useful information.

2 Likes

I have a rx6600 (same family as 6800) and don’t have this issue. Have you tried querying ollama in other ways ? After you run ollama run (model), you can open another xterm and send a request with

Ollama run (model) ā€œsearch queryā€

Does that also not work ?

I followed the instructions exactly (not the updated part with vulkan environment for running directly vs a service)… but have no issues, don’t think that is the issue

BTW you can increase your xterm font size with a parameter in .Xresources or a direct command

1 Like

Thanks for these suggestions ! What inference provider are you using for Hermes… a local model or one of the providers ? I can already see how these other options are helpful for other avenues of usage. LM allowed me to easily get the 4 bit quantized GGUF version of 9b qwen. And much easier ways to attach files

Lots of issues with running the LM appimage. EDIT: Nevermind, I somehow missed the .deb download and that worked easily.

1 Like

I’m using Hermes with a local model, currently Qwen3.6-35B-A3B, it’s connected to LM Studio.

With the way agents work, I’m not sure if I would use it with a commercial AI. The agent has persistent memory, and it uses its memory to improve the quality of responses.

I was asking Hermes some questions about financial investing, it used its memory of me to provide detailed information on the expected difference between buying stocks and using pension plans while taking my local tax laws into account. If you did the same with a commercial AI, you could be sending your real name, age, physical address, current income, etc. to the AI provider.

2 Likes

exactly, I forgot to run ā€œexport OLLAMA_VULKAN=1ā€ @corny

1 Like

I’m still in the testing phase. It’ll take quite a while to ā€œtrustā€ AI with any personal info.

Thanks for sharing your use case and your updated setup. Perhaps I’m just unimaginative but coming up with personal use cases for AI is one of my flaws

1 Like

It blows my mind the extent regular people (people in my life!) can so swiftly open their work and lives to these corporations, asking the kinds of confessional questions they might be unlikely to share even with their own family or doctors or therapists. Corporate AI is nearly the worst possible outlet you could choose to be open about your personal vulnerabilities. We are on a dark timeline and the intel you give away can and will be used against you, eventually. It may sound doomer, but if you look at this current world and think about how it’s likely to advance even in the short term, and perceive the obvious latent propagandistic value of what people share, trivially mined by these exact same LLMs, it’s so clear and so disappointing.

Sorry for the drive-by rant. Offline models ftw.

1 Like

Thanks so much for this detailed guide!!

Could you share more insights into how you setup Hermes Agent? E.g. did you run the installer script in a template or appvm ($ curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash)? If you used a minimal Debian 13 template, what sort of customization did you need to do beforehand (apart from installing networking, curl, bash)?

The templates I use are custom gnome minimal templates, it’s a minimal template with Qubes OS packages needed for net access + gnome packages needed to use Nautilus and Gnome Terminal.

The VMs I use are standalone, you need to install so many packages it’s not really practical to use templates. Hermes can use tool calling to configure the system, which also doesn’t work with templates.

I have one VM running the model, and providing the API.
Currently, it’s using LM Studio, but I’ll switch it to running llama.cpp and using llama-server to provide the API. Ollama or LM Studio is easier to use, but llama.cpp has the best performance.

I have another VM running Hermes Agent, which is connected to the API using qrexec.

You can run everything in a single VM, but you probably want Hermes to have net access, and you might want to keep the API offline.

In the Hermes VM I used the installation script you linked, once it’s done installing you can use hermes setup to configure Hermes. In the provider menu, select custom, and point it to the API end-point. If you are using qrexec to access the API, the end-point should be http://127.0.0.1:11434/v1.

When the provider is configured, you can use hermes dashboard --tui to open the Hermes webinterface in your default browser, from there you can continue to use and configure Hermes.

3 Likes

I’m curious, does tool usage work in such a setup?