I’m passing through two GPUs to a sys-ollama
appVM and have been doing so for a while now
Both GPUs are connected to PCIe 4.0 slots (x16 electrical) and running at PCIe 4.0 8x. They’re attached via qvm-pci attach --persistent -o no-strict-reset=True -o permissive=True
(I also tried without the no-strict-reset=True
and had the same behavior)
The behavior: When I start up Qubes and look in dom0
, I see what I expect for each of the GPUs:
$ for bdf in $(sudo lspci | grep -E 'VGA.*NVIDIA' | cut -d ' ' -f 1); do
sudo lspci -s $bdf -vvv | grep LnkSta: ; done
LnkSta: Speed 16GT/s, Width x8 (downgraded)
LnkSta: Speed 16GT/s, Width x8 (downgraded)
The “downgraded” bit is because of the x8 and is expected
What is not expected is when I start up sys-ollama
, the LnkSta
changes to the following, in BOTH dom0 and the appVM:
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
When I detach it, it returns to the full speed in dom0 lspci -vvv
I’m almost certain that I didn’t encounter this previously, and it seems to take a very long time when loading a model in the appVM, so I think it is accurately reporting the slow link speed - maybe imagining it, but I don’t think so…
There are two differences since the time I last looked closely at this, when it was working as expected:
- I’ve made physical changes to the hardware configuration (notably, moving the GPUs around - an obvious variable of interest)
- Newer kernels (currently on
6.6.63-1.qubes.fc37.x86_64
indom0
and the Debian 12 kernel6.1.0-29-amd64
in the appVM - Probably at least 1 or 2 Qubes updates (this is Qubes R4.2.3)
- A few updates of the AppVM, which likely included nvidia-drivers. Currently on
570.86.15
Anyone knowledgeable about the hardware or software/virtualization side of this able to tell me what might be going on here?
It seems peculiar to me that dom0 reports the expected link speed until I attach it to the appVM, and returns to the expected link speed when I detach it
I will attach journalctl -b
/ dmesg
output if necessary, if this isn’t some simple known issue or silly oversight on my part
EDIT: I have AER enabled and I checked both dom0 and the appVM, I don’t see any AER messages indicating PCI issues