Typically with enterprise cards, you can divide the GPU VRAM into fixed chunks but distribute the load. So let’s say you have a 16GB GPU and 4 VMs. You can split the 16GB into 4GB each, but the load is where it needs to be.
Yes, you will need the enterprise card firmware and flash it to a consumer A770. The Level1Techs forum is a good place to start.
That is the whole point as I understand it. You can use it on all AMD cards now. But again, not sure.
I have Ollama running as an offline qube, and other qubes can access the API with qvm-connect-tcp.
It binds the local port 11434 to the same port of the Ollama API qube, most applications will just work out of the box because it looks like Ollama is running locally on the same machine.
Modifying the firewall rules to allow for remote connections also works, but not all applications allow remote connections without https, and Ollama only supports http connections.
Mesa 25 has not been released yet. It is expected by March. However, this is a very exciting development and I’m wondering if/when Qubes OS is going to take advantage of it
I just bought an Nvidia A2 to try and use the SR-IOV / vGPU feature with Qubes.
Will report back in a couple of weeks when I get around testing.
Awesome. Looking forward to reading your report.
Thanks for the clarification on qvm-connect-tcp. I was not aware of that tool, although I am familiar with the general QubesOS architecture and knew that it works somewhat along those lines.
Please give us a detailed report on this. However, if I understand the new AMDGPU feature, SR-IOV is no longer needed. So it comes down to quality of implementation and ease of use.
I am looking forward to the RTX 5090 with 32GB VRAM. This will be great for Ollama and Stable Diffusion/FLUX. With an average laptop GPU of 8GB VRAM I would say Lama 3.1 8B or Lama 3.2 3B should work quite well in their standard Ollama quantization of Q4. How do you use Ollama and which models do you find most useful? Any uncensored models?
Can you document the exact steps please? I know how to use Ollama, I mean the QubesOS part.
You basically have the API exposed and then you access it with qvm-connect-tcp from another qube with dom0 permission check? So you could run Ollama in one VM and connect to it with Open WebUI from another?
I don’t think the 5090 is worth $2000, not unless you are filthy rich and the price doesn’t matter. If you buy two 4060 16 GB, you get 32 GB VRAM for 1/3 the price. Pass-through with two GPUs works without any issues on my system, and Ollama will just use both GPUs out of the box.
Using 2 GPU will work in most motherboards, I use the full x16 slot for the GPU I also use for gaming, and the GPU I only for the AI is in the x4 slot.
Another advantage of two GPU over one, you can assign them to different VMs, you don’t get the option with a single GPU.
Before buying a 5090, I would wait and see if Intel makes a 24GB Arc B580 GPU. If they do, and they are reasonably price, you can maybe get 48 GB VRAM at half the price of the 5090.
For a general purpose assistant, I’m currently using qwen2.5 14B and Gemma 27B, I believe they are both Q4.
For coding, I use qwen2.5-coder 14B as the assistant and 3B for code completion.
I mostly use Page Assist for browser integration for Continue for VS Code integration, they both can switch between all models available in Ollama.
This explains how to use qvm-connect-tcp
https://www.qubes-os.org/doc/firewall/#opening-a-single-tcp-port-to-other-network-isolated-qube
Any application that supports AI integration will be able to use Ollama directly.
I know that Ollama can handle/share multiple GPUs in terms of VRAM. I never tried it, so please explain the setup procedure. From my understanding, you glue them together somehow to get more VRAM for models (please tell me how), but the inference performance does not scale that way. Also when you do other things like image or video generation, such VRAM scaling does not work that way. Here a 5090 is still a good deal in my opinion.
I have already used Intel A580 with Ollama, but the setup pain and bumping into caps at the point where the community does not provide easy Docker containers is not worth the effort in my opinion. Same with ROCm and AMD cards. It is a great deal when it works, but super annoying when you want to use that one thing that is only NVIDIA compatible.
You don’t need to glue anything, you just add both GPUs to the VM and it works automatically. You are also wrong about the performance, you get twice the token/s with 2 GPUs compared to 1 GPU.
I know there have been efforts to make efficient use of multiple GPUs. Looks like the Ollama team have already added that, which is great news. Thanks for the update! Unfortunately, this does not apply to image/video generation.
Just wanted to give a quick update on Nvidia vGPU even though no progress has been made. I was not able to get a 90 day trial for NVIDIA vGPU so far… It took their sales department a month to get back to me and after a quick call with their sales rep there’s radio silence again
Waiting to get pricing info and the trial so I can download the vGPU drivers.
He did mention something about the base license for vGPU being a mid 4 figure number, I hope I just misunderstood him but probably not.
I’ll definitely still test if it works in Qubes once I get the drivers, but if the cost is that high I’d rather sell the A2 and stuff 4 smaller gpus in my workstation
New to the Qubes forum, I think I can help you with the drivers. Nvidia does not release the driver specifically for qubes, but there are drivers for Xen based hyperviors like Citrix and Xen Server. Is there way to dm people here? if so reach me or we can continue in the thread. Let me know where we can connect on this.
Thanks for the offer man!
I was able to get a 30 day trial from Nvidia after all and downloaded different variants of the drivers. They have a KVM, RHEL and Xenserver versions.
Tried to install the drivers according to their documentation: Virtual GPU Software User Guide - NVIDIA Docs but no dice. The kernel driver module gets compiled and after compilation is done the installer tries to load the new module and returns an error.
The Nvidia driver needs the exact same version of gcc installed on the system, that the kernel was built with. Haven’t had time to figure that part out yet.
The Nvidia vGPU design is suboptimal for use with Qubes, since it requires a connection to an Nvidia License Server to unlock the full performance of the vGPUs.
If I understand the docs correctly, a local installable License Server was available but has been deprecated and one would need to use the Nvidia Cloud License Server.
There is an open-source implementation of the Nvidia License Server available though.
Either way I noticed this HCL report, the user was able to activate sr-iov for the intel igpu and use it successfully. I hope we’ll get a howto soon, since I have 2 devices with igpus that should be compatible.
I most likely will sell the Nvidia A2 gpu again, since I don’t have the time to dig in until I get it running. If somebody is itching to do the necessary work I would be open to loan the gpu out for a couple of months.
Hoping for the same and that is why I had to pick Intel over the AMD CPU’s. Honestly, if the SR-IOV is enabled in Intel Arc - Alchemist series of GPU’s it would be sold like hot cakes.
Hoping for the Intel Data Centre flex gpus to hit the used market ![]()
Someday perhaps…
I have couple of Tesla P4’s, I have not used with Qubes but I think I might know the missing part you need to have. It is from the XCPng forums, you migh need to take a look at before selling the card.
link - GPU support and Nvidia Grid vGPU | XCP-ng and XO forum
I have success when I was using XCPng. Take a look, I do not know what else you need to modify for Qubes.
edit - Take a look at the entire forum post for getting the context.