AMD iGPU passthrough attempt

FWIW, kernel 5.14.15 gave me problems in dom0, resulting in a bootloop.

Seemed to be amdgpu related. Issue was present 5.14.15, 5.14.16, but not in 5.14.17.

Ref: dom0 boot loop with kernel-latest-5.14.15 · Issue #7089 · QubesOS/qubes-issues · GitHub

That must be an ASIC-specific issue then no such issue with the RENOIR. However, I still have my NAVI14 dGPU (RX 5500 M) disabled because of a boot loop too.

Since the kernel panic (which induces a qemu crash and forces me to powerdown) is linked to VCN, let’s check what happens when we disable this non-essential IP (and the equally non-essential jpeg one while I’m at it), with amdgpu.ip_block_mask=0xff. More IPs get finalized, and we then hit a new one:

[2021-11-28 13:54:36] <4>[    7.604916] amdgpu: probe of 0000:00:05.0 failed with error -22
[2021-11-28 13:54:36] <6>[    7.605226] [drm] sw_fini of IP block <dm>...
[2021-11-28 13:54:36] <6>[    7.605252] [drm] sw_fini of IP block <sdma_v4_0>...
[2021-11-28 13:54:36] <6>[    7.605275] [drm] sw_fini of IP block <gfx_v9_0>...
[2021-11-28 13:54:36] <4>[    7.605426] ------------[ cut here ]------------
[2021-11-28 13:54:36] <4>[    7.605437] WARNING: CPU: 1 PID: 278 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2d1/0x300 [ttm]
[2021-11-28 13:54:36] <4>[    7.605465] Modules linked in: intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel joydev ghash_clmulni_intel serio_raw pcspkr amdgpu(+) ip6table_filter ip6table_mangle ip6table_raw ip6_tables ipt_REJECT nf_reject_ipv4 iommu_v2 gpu_sched xt_state xt_conntrack i2c_algo_bit drm_ttm_helper ttm iptable_filter iptable_mangle drm_kms_helper iptable_raw xt_MASQUERADE cec ehci_pci ata_generic ehci_hcd i2c_piix4 iptable_nat nf_nat nf_conntrack pata_acpi nf_defrag_ipv6 nf_defrag_ipv4 xen_scsiback target_core_mod xen_netback uinput xen_privcmd xen_gntdev xen_gntalloc drm xen_blkback fuse xen_evtchn bpf_preload ip_tables overlay xen_blkfront
[2021-11-28 13:54:36] <4>[    7.605611] CPU: 1 PID: 278 Comm: systemd-udevd Not tainted 5.15.4-1.fc32.qubes.x86_64 #1
[2021-11-28 13:54:36] <4>[    7.605629] Hardware name: Xen HVM domU, BIOS 4.14.3 11/25/2021
[2021-11-28 13:54:36] <4>[    7.605644] RIP: 0010:ttm_bo_release+0x2d1/0x300 [ttm]
[2021-11-28 13:54:36] <4>[    7.605658] Code: 35 25 00 00 e9 83 fd ff ff e8 7b ae 33 f5 e9 bc fd ff ff 49 8b 7e 98 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 f1 d2 33 f5 eb a2 <0f> 0b e9 50 fd ff ff e8 33 b4 33 f5 e9 fd fe ff ff be 03 00 00 00
[2021-11-28 13:54:36] <4>[    7.605693] RSP: 0018:ffffbd34002dbbe0 EFLAGS: 00010202
[2021-11-28 13:54:36] <4>[    7.605705] RAX: 0000000000000001 RBX: ffff9761144e92e0 RCX: 000000000000000f
[2021-11-28 13:54:36] <4>[    7.605720] RDX: 0000000000000001 RSI: ffffe8b5c0313200 RDI: ffff97610c4e79b8
[2021-11-28 13:54:36] <4>[    7.605736] RBP: ffff9761144e5270 R08: ffff97610c4e79b8 R09: ffffe8b5c0312d00
[2021-11-28 13:54:36] <4>[    7.605751] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9761144f5a70
[2021-11-28 13:54:36] <4>[    7.605766] R13: ffff97610c4e7858 R14: ffff97610c4e79b8 R15: ffff976101bbf37c
[2021-11-28 13:54:36] <4>[    7.605782] FS:  00007203fc664380(0000) GS:ffff97613cd00000(0000) knlGS:0000000000000000
[2021-11-28 13:54:36] <4>[    7.605799] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2021-11-28 13:54:36] <4>[    7.605811] CR2: 00007ca9fc42a000 CR3: 0000000008a2e000 CR4: 0000000000350ee0
[2021-11-28 13:54:36] <4>[    7.605828] Call Trace:
[2021-11-28 13:54:36] <4>[    7.605835]  <TASK>
[2021-11-28 13:54:36] <4>[    7.605843]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[2021-11-28 13:54:36] <4>[    7.606024]  gfx_v9_0_sw_fini+0xca/0x1a0 [amdgpu]
[2021-11-28 13:54:36] <4>[    7.606180]  amdgpu_device_ip_fini.isra.0.cold+0x27/0x55 [amdgpu]
[2021-11-28 13:54:36] <4>[    7.606369]  amdgpu_device_fini_sw+0x16/0x100 [amdgpu]
[2021-11-28 13:54:36] <4>[    7.606514]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[2021-11-28 13:54:36] <4>[    7.606657]  devm_drm_dev_init_release+0x3d/0x60 [drm]
[2021-11-28 13:54:36] <4>[    7.606686]  devres_release_all+0xb8/0x100
[2021-11-28 13:54:36] <4>[    7.606700]  really_probe+0x100/0x310
[2021-11-28 13:54:36] <4>[    7.606710]  __driver_probe_device+0xfe/0x180
[2021-11-28 13:54:36] <4>[    7.606722]  driver_probe_device+0x1e/0x90
[2021-11-28 13:54:36] <4>[    7.606732]  __driver_attach+0xc0/0x1c0
[2021-11-28 13:54:36] <4>[    7.606741]  ? __device_attach_driver+0xe0/0xe0
[2021-11-28 13:54:36] <4>[    7.606753]  ? __device_attach_driver+0xe0/0xe0
[2021-11-28 13:54:36] <4>[    7.606763]  bus_for_each_dev+0x89/0xd0
[2021-11-28 13:54:36] <4>[    7.606773]  bus_add_driver+0x12b/0x1e0
[2021-11-28 13:54:36] <4>[    7.606782]  driver_register+0x8f/0xe0
[2021-11-28 13:54:36] <4>[    7.606791]  ? 0xffffffffc0db9000
[2021-11-28 13:54:36] <4>[    7.606800]  do_one_initcall+0x57/0x200
[2021-11-28 13:54:36] <4>[    7.606811]  do_init_module+0x5c/0x260
[2021-11-28 13:54:36] <4>[    7.606821]  __do_sys_finit_module+0xae/0x110
[2021-11-28 13:54:36] <4>[    7.802026]  do_syscall_64+0x3b/0x90
[2021-11-28 13:54:36] <4>[    7.802038]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[2021-11-28 13:54:36] <4>[    7.802051] RIP: 0033:0x7203fd605edd
[2021-11-28 13:54:36] <4>[    7.802061] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 7f 0c 00 f7 d8 64 89 01 48
[2021-11-28 13:54:36] <4>[    7.802099] RSP: 002b:00007fffb0573118 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[2021-11-28 13:54:36] <4>[    7.802117] RAX: ffffffffffffffda RBX: 00006397ad8e6370 RCX: 00007203fd605edd
[2021-11-28 13:54:36] <4>[    7.802133] RDX: 0000000000000000 RSI: 00006397ad8e6f80 RDI: 0000000000000014
[2021-11-28 13:54:36] <4>[    7.802149] RBP: 0000000000020000 R08: 0000000000000000 R09: 00006397ad8e6fe0
[2021-11-28 13:54:36] <4>[    7.802166] R10: 0000000000000014 R11: 0000000000000246 R12: 00006397ad8e6f80
[2021-11-28 13:54:36] <4>[    7.802182] R13: 00006397ad8e0710 R14: 0000000000000000 R15: 00006397ad8e74f0
[2021-11-28 13:54:36] <4>[    7.802199]  </TASK>
[2021-11-28 13:54:36] <4>[    7.802206] ---[ end trace b49c9edf581387d3 ]---
[2021-11-28 13:54:36] <6>[    7.802286] [drm] sw_fini of IP block <smu>...
[2021-11-28 13:54:36] <6>[    7.802302] [drm] sw_fini of IP block <psp>...
[2021-11-28 13:54:36] <6>[    7.802332] [drm] sw_fini of IP block <vega10_ih>...
[2021-11-28 13:54:36] <6>[    7.802496] [drm] sw_fini of IP block <gmc_v9_0>...
[2021-11-28 13:54:36] <4>[    7.802519] ------------[ cut here ]------------
[2021-11-28 13:54:36] <4>[    7.802530] Memory manager not clean during takedown.

This one seems to talk about a GPU-memory management issue. Guess I’ll stop here chasing those downstream crashes, at least this one doesn’t crash qemu and spares me some reboots.

Progress has been slow, and happening mostly on an amd-gfx thread. Only today did I see the guest amdgpu driver start up for the first time - although this is a big step, but there are still a couple of glitches getting in the way of video output.
With a bit of luck, Santa may be only slightly late with this christmas present :wink:

1 Like

Damn this post and the linked/related ones are a great way to understand how things work under the hood ! ^^

Just a noob remark, have you tried by blacklisting amdgpu in dom0 and assigning the device to xen-pciback ? I read nowhere that you tried it.
This would prevent dom0 and/or the driver from doing nasty things with your GPU before PT-ing !

Below is my working method for a RX580, maybe that works for you too ?

Some notes before
  • I know the RX580 is not a iGPU, and I’m using it in a Ryzen desktop CPU (Ryzen 1700X), and there are many things I don’t know, but this method may be of help to others
  • the RX580 card has no FLR, is on the primary x16 PCI slot, so it’s used for displaying BIOS POST and early kernel messages, then xen-pciback seizes it, and the display switches to my other GPU, fortunately an Nvidia (so no driver conflict).
  • those instructions are for a Debian-based dom0, please carefully adapt. I just started Qubes, so I don’t know the correct paths and don’t wanna say 5h!t ! ^^
  • the RX580 must NEVER leave the pci-assignable pool, or hell will fall on you.

Steps

1. Modules config

  • First ensure that /etc/modules or modprobe.d/ contains this
    (PS: it’s already done on Qubes, in /etc/sysconfig/modules/qubes-dom0.modules)
xen-pciback
  • In /etc/modprobe.d/atigpu-blacklist.conf (for Qubes /etc/sysconfig/modules/atigpu-blacklist.conf seems the right place)
blacklist amdgpu

As you also have an AMD dGPU, I think you need an extra step to reload the driver once the domU containing the iGPU is started, but I’ve not tested it : my setup uses a Nvidia GPU for dom0, so it’s easier.

2. initramfs config

  • Create a new script like /usr/share/initramfs-tools/scripts/init-top/zload_xen-pciback, and don’t forget to chmod +x zload_xen-pciback, it’s a sh script.
    PS: no idea where this script should be in Qubes !
#!/bin/sh
modprobe xen-pciback hide=\(0000:25:00.0\)\(0000:25:00.1\)
  • In /usr/share/initramfs-tools/scripts/init-top/udev
    PS: no idea where this script should be in Qubes !
# change
PREREQS=""
# to
PREREQS="zload_xen-pciback"
  • Last thing, don’t forget to regenerate your initramfs (this too I dunno how to do on Qubes/Fedora).
  • To correctly adapt the paths to Qubes, read the “credit link” below. In short, in Debian, initramfs scripts in /usr take precedence over initramfs scripts in /etc.

3. End credits ^^

Voilà, I hope it works for you !
For more detailed explanations of how and why it works, and the credit for inspiration, check this link.

Sure, but I’m using pci-stub for this rather than xen-pciback.

Just noticed an interesting patch on amd-gfx, will have to get back to testing this soon: [PATCH] drm/amdgpu/gmc: use PCI BARs for APUs in passthrough

I couldn’t find a lot of info about pci-stub, do you have docs/pointers please ? The Xen wiki recommends using pciback as it works for PV and HVM, so I’ve always thought it was more up-to-date.

And sorry to insist, but have you tried the initramfs+udev method above ? I’ve found it working better for my (dedicated) AMD GPU (and other devices loaded early like additional SATA controllers).
From what I understand, usual blacklisting/modprobing happens too late for some devices, as xen-pciback/pci-stub is only loaded after the device module is loaded.

Btw, thanks for sharing your tests, don’t stop, I learn a lot even if not understanding most parts ^^

@yann - did you get any further with this by any chance?

I opened a fearture request on the main AMD driver repo Enable the AMDGPU driver to work for a Ryzen APU (E.g. 5700G) passed through to a Linux (or Windows) VM (#2046) · Issues · drm / amd · GitLab

I feel as though we were nearly there, but I’ve not heard anything further from the main contributor.