Issues with GPU passthrough / amdgpu

I managed to install Win10 using qvm-create-windows-qube - which is a great tool, BTW, thanks for suggesting it!

And it seems the GPU actually does work in the VM, the VM can be restarted without any PCI errors etc. I haven’t done any extensive testing - essentially just installed pyopencl, drivers from AMD and tried some simple opencl examples. And it seems to work - there are no errors, the AMD tool shows increased GPU utilization etc.

So there’s a chance this might work with a Linux VM, hopefully. This Windows experiment just reminded me why I left that world a decade ago and how used I’m to the Linux environment.

FWIW there actually was one hiccup - I tried to inrease the amount of RAM for the Windows VM, because 1GB is rather low. I ramped it up only to find the VM no longer boots, because the SeaBIOS fails with

Boot failed: could not read the boot disk

I initially thought this happened because I attached the PCI devices, but after some experimentation I figures it’s the amount of RAM configured for the VM. The exact value where it breaks is 1299MB (while 1298MB still boots). I have no idea why.

2 Likes

Great ! Now you know that it can work.

  • For the linux vm, can you show your boot command line ? ( “cat /proc/cmdline” inside the linux vm )
  • Also try to add “pci=nomsi” in the vm boot parameter.
  • Can try different kernel version to check if the issue is always the same ? (You can download pre compiled kernel version from fedora to check quickly )

/proc/cmdline looks like this:

BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.0.16-300.fc37.x86_64 root=UUID=${UUID} ro rootflags=subvol=root rhgb quiet

I tried adding pci=nomsi but that didn’t change anything.

As for the kernel versions, I already did those experiments a couple days ago - I tried a couple older kernels available from fedora (5.19 and 6.0 IIRC) and then I built a couple custom kernels including 5.10, 5.15 and a bunch of others.

And I think the issues were always the same - no support for the GPU up to 5.14, and then it fails pretty much exactly the same way when loading/initializing the module (just like in the dmesg in my first post).

can you post the full “sudo dmesg” ?
you could check the driver issues tracker Issues · drm / amd · GitLab .
(I don’t have any others idea left)

Sure, here’s the dmesg (64.7 KB)

Thanks for the link to the issues, I’ll take a look. And thanks for all the help!

Seems also interesting but didn’t do any search:

[Sat Jan  7 21:49:42 2023] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[Sat Jan  7 21:49:42 2023] PCI: Using E820 reservations for host bridge windows
[Sat Jan  7 21:49:42 2023] ACPI: Enabled 2 GPEs in block 00 to 0F
[Sat Jan  7 21:49:42 2023] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[Sat Jan  7 21:49:42 2023] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments EDR HPX-Type3]
[Sat Jan  7 21:49:42 2023] acpi PNP0A03:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[Sat Jan  7 21:49:42 2023] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.

[Sat Jan 7 21:49:42 2023] pci 0000:00:03.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]

[Sat Jan  7 21:49:42 2023] pci 0000:00:06.0: can't claim BAR 0 [mem 0x60e0000000-0x60efffffff 64bit pref]: no compatible bridge window
[Sat Jan  7 21:49:42 2023] pci 0000:00:06.0: can't claim BAR 2 [mem 0x60f2000000-0x60f21fffff 64bit pref]: no compatible bridge window

Interesting. As it’s related to “extended PCI configuration” it’s quite plausible it’s related to the issue. I wonder why it’s happening, though. I found this ([Resolved] No extended PCIe config space access - CentOS) which suggests to add memmap=, but no matter what parameters I tried I always get

Physical KASLR disabled: no suitable memory region!

Other posts that I found suggest it happens in VMs with too little memory, but I used 2GB (and even bumped it up to 4GB).

Also, I searched through the gitlab issues and there are some with very similar symptoms. E.g. this one (SMU fails to be resumed from runtime suspend (#2173) · Issues · drm / amd · GitLab) has almost the same failure. All the stuff I found was related to suspend / resume - not sure how is that related to the VM startup, but I guess it’s similar to the initialization that needs to happen on resume.

This suggests adding

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nokaslr"
If inserting `nokaslr` doesn't work, then insert `kaslr` instead.

I tried the nokaslr parameter, and that changes the things a bit - the “no suitable memory region” message disappears, and instead it prints

KASLR disabled: "nokaslr" on cmdline

but then the VM just dies / shuts down.

FWIW the dmesg says this

[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007effefff] usable
[    0.000000] BIOS-e820: [mem 0x000000007efff000-0x000000007effffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000fc00afff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000fc00b000-0x00000000ffffffff] reserved
[    0.010926] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.010927] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.185653] PCI: Using E820 reservations for host bridge windows
[    0.259646] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[    0.259647] e820: reserve RAM buffer [mem 0x7efff000-0x7fffffff]

so there should be usable memory between 1M and ~2130M, so I tried memmap=64M$1M a number of similar values. No luck, and there does not seem to be anything on the console before the VM shuts down.

Do you have a way to test with a much more ram ? like 8go ?

I tried with 8GB for this VM. Sadly, it doesn’t work, because the startup immediately fails with

Boot failed: could not read the boot disk

so I don’t even get to grub :frowning:

It seems the highest value that works is 3600MB - if I go any higher, it fails. This is suspiciously close to the 3.5GB threshold mentioned in windows gaming HVM guide. But I do have the qemu-stubdom-linux-rootfs patch in place, so that shouldn’t be it. :thinking:

I was wondering how come the Windows VM stopped booting at 1299MB and the Linux VM at 3601MB. I mean, it’s almost exactly twice the amount of RAM, that’s a bit suspicious. And I realized I increased the number of cpus for the Linux VM to 4, while Windows used the default 2. Which matches the 2x difference … so I bumped the number of cpus to 10, and the Linux VM booted with 8GB of RAM. I have no idea what’s causing this, though.

Anyway, no change on the GPU front - the MMCONFIG still fails, and the GPU does not work.

And if I try memmap=64M$0x100000000 (which should be usable region per dmesg), the boot fails (even with nokaslr).

3 Likes

since this patch qemu: fix TOLUD for PCI passthrough by mati7337 · Pull Request #44 · QubesOS/qubes-vmm-xen-stubdom-linux · GitHub , manually modifying rootfs should not be required. But this ram amount issue is interesting/strange

For my own issues and tests, I recompiled vmm-xen-stubdom-linux to remove this patch, then modified manually the rootfs. I got different result (if my hardware is not trolling me), could be worth trying.

1 Like

Yeah, the memory issue is pretty bizarre. It’s a fresh Qubes install so I’m mostly sure haven’t broken anything by messing with it. I’ll try applying the patch, worth a try.

I didn’t have time to try applying the patch, but I noticed a couple interesting thing about the memory limit issue while experimenting with the Windows VM - it only happens when the PCI devices (for the GPU) are attached.

The VM is created with 2GB of RAM, but as soon as I attach the PCI devices, it refuses to boot with more than 1198MB. If I detach the GPU devices, I can start the VM with 8GB of RAM just fine, there’s no issue with it. But as soon as I attach the PCI devices, I start getting

Boot failed: could not read the boot disk

The second interesting thing is the number of CPUs doesn’t seem to make a difference. On the Linux VM increasing the number of CPUs allowed me to use more memory, but on Windows it doesn’t seem to make any difference.

I think I have see this

But I also think I’ve seen some other solution to this here, which I can’t find at the moment.

I tried rebuilding the stubdom, but that didn’t make a significant difference - the amount of RAM that I can assign to the VM increased, but only a little bit, from 1198MB to 1298MB. It’s rather suspicious the difference is exactly 100MB …

I’ve also tried downgrading the package to 1.2.3-1 (which per VM with USB controller unable to start with more than 2200MB memory · Issue #7624 · QubesOS/qubes-issues · GitHub worked as a workaround)

qubes-dom0-update --action=downgrade xen-hvm-stubdom-linux

but then the VM doesn’t start at all. I guess something else changed, requiring the new stubdom version. So I’ve upgraded back to 1.2.5-1 again.

Anyway, I’ve never built this stuff for qubes, so let me explain what I did - maybe I missed something else:

  1. clone GitHub - QubesOS/qubes-vmm-xen-stubdom-linux
  2. FETCH_CMD=“wget -O” make get-sources
  3. cp dl/* ./
  4. install packages in the README, and also ninja-build and elfutils-libelf-devel
  5. make -f Makefile.stubdom
  6. copy build/rootfs/stubdom-linux-rootfs to dom0:/usr/libexec/xen/boot/qemu-stubdom-linux-rootfs

To compile packages/iso/template for qubes, usually you can use that: GitHub - QubesOS/qubes-builder: Qubes Builder or the new version that work well GitHub - QubesOS/qubes-builderv2: Next generation of Qubes OS builder , you configure the builder.conf or builder.yml to build what you want then follow the github instruction.

And for the patch things, something like that GitHub - neowutran/qubes-vmm-xen-stubdom-linux at tolud , the important bit is Comparing QubesOS:main...neowutran:tolud · QubesOS/qubes-vmm-xen-stubdom-linux · GitHub to remove the tolud patch

( I don’t know if it will solve nor help for your issue, but I am having some similar issue on my new hardware, and removing this helped me and changed the behavior )