AMD iGPU passthrough attempt

So here I am with a small PoC commit doing precisely this. And indeed there is some good news: the driver does load the my VBIOS ROM and appears to like it, but soon things turn out not to be so fine (at first sight unrelated with VBIOS) with…

  • a strange-looking MTRR write failure
  • some trouble with the PSP firmware failing to load, triggering the termination of the amdgpu driver
  • … and then dereferencing a bad pointer (bug in the error path?) sends the kernel to panic, and possibly inducing a qemu segfault
  • … which result in unresponsive Qubes and requires hard poweroff
[2021-11-23 21:05:52] [    4.297684] amdgpu 0000:00:05.0: amdgpu: Fetched VBIOS from firmware file
[2021-11-23 21:05:52] [    4.297709] amdgpu: ATOM BIOS: 113-RENOIR-025
[2021-11-23 21:05:52] [    4.302046] [drm] VCN decode is enabled in VM mode
[2021-11-23 21:05:52] [    4.302066] [drm] VCN encode is enabled in VM mode
[2021-11-23 21:05:52] [    4.302078] [drm] JPEG decode is enabled in VM mode
[2021-11-23 21:05:52] [    4.302144] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[2021-11-23 21:05:52] [    4.302181] amdgpu 0000:00:05.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[2021-11-23 21:05:52] [    4.302217] amdgpu 0000:00:05.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[2021-11-23 21:05:52] [    4.302246] amdgpu 0000:00:05.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[2021-11-23 21:05:52] [    4.302268] mtrr: base(0x430000000) is not aligned on a size(0x20000000) boundary
[2021-11-23 21:05:52] [    4.302289] Failed to add WC MTRR for [000000000998bb55-00000000eb9e681e]; performance may suffer.
[2021-11-23 21:05:52] [    4.302295] [drm] Detected VRAM RAM=512M, BAR=512M
[2021-11-23 21:05:52] [    4.302341] [drm] RAM width 128bits DDR4
[2021-11-23 21:05:52] [    4.302401] [drm] amdgpu: 512M of VRAM memory ready
[2021-11-23 21:05:52] [    4.302412] [drm] amdgpu: 691M of GTT memory ready.
[2021-11-23 21:05:52] [    4.302437] [drm] GART: num cpu pages 262144, num gpu pages 262144
[2021-11-23 21:05:52] [    4.302565] [drm] PCIE GART of 1024M enabled.
[2021-11-23 21:05:52] [    4.302575] [drm] PTB located at 0x000000F400900000
[2021-11-23 21:05:52] [    4.312921] amdgpu 0000:00:05.0: amdgpu: PSP runtime database doesn't exist
[2021-11-23 21:05:52] [    4.342353] [drm] Loading DMUB firmware via PSP: version=0x01010019
[2021-11-23 21:05:52] [    4.346679] [drm] Found VCN firmware Version ENC: 1.14 DEC: 5 VEP: 0 Revision: 20
[2021-11-23 21:05:52] [    4.346723] amdgpu 0000:00:05.0: amdgpu: Will use PSP to load VCN firmware
[2021-11-23 21:05:52] [    4.978736] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
[2021-11-23 21:05:52] 
[2021-11-23 21:05:52] Fedora 33 (Thirty Three)
[2021-11-23 21:05:52] Kernel 5.14.15-1.fc32.qubes.x86_64 on an x86_64 (hvc0)
[2021-11-23 21:05:52] 
[2021-11-23 21:05:52] sys-gui-gpu login: [    5.136770] input: dom0: AT Translated Set 2 keyboard as /devices/virtual/input/input7
...
[2021-11-23 21:05:55] [    7.675982] [drm] psp command (0xFFFFFFFF) failed and response status is (0xFFFFFFFF)
[2021-11-23 21:05:55] [    7.676007] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
[2021-11-23 21:05:55] [    7.676213] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[2021-11-23 21:05:55] [    7.676371] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[2021-11-23 21:05:55] [    7.676530] amdgpu 0000:00:05.0: amdgpu: amdgpu_device_ip_init failed
[2021-11-23 21:05:55] [    7.676563] amdgpu 0000:00:05.0: amdgpu: Fatal error during GPU init
[2021-11-23 21:05:55] [    7.676578] amdgpu 0000:00:05.0: amdgpu: amdgpu: finishing device.
[2021-11-23 21:05:55] [    7.679044] amdgpu: probe of 0000:00:05.0 failed with error -22
[2021-11-23 21:05:55] [    7.679102] BUG: unable to handle page fault for address: ffffb1f120cdf000
[2021-11-23 21:05:55] [    7.679117] #PF: supervisor write access in kernel mode
[2021-11-23 21:05:55] [    7.679129] #PF: error_code(0x0002) - not-present page
[2021-11-23 21:05:55] [    7.679140] PGD 1000067 P4D 1000067 PUD 11dc067 PMD 0 
[2021-11-23 21:05:55] [    7.679154] Oops: 0002 [#1] SMP NOPTI
[2021-11-23 21:05:55] [    7.679163] CPU: 0 PID: 276 Comm: systemd-udevd Not tainted 5.14.15-1.fc32.qubes.x86_64 #1
[2021-11-23 21:05:55] [    7.679180] Hardware name: Xen HVM domU, BIOS 4.14.3 11/14/2021
[2021-11-23 21:05:55] [    7.679194] RIP: 0010:vcn_v2_0_sw_fini+0x10/0x40 [amdgpu]
[2021-11-23 21:05:55] [    7.679367] Code: 66 f0 83 c2 81 c6 ea 05 00 00 31 c9 4c 89 cf e9 b6 4d ee ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 8b 87 38 17 01 00 48 89 fd <c7> 00 00 00 00 00 e8 d5 d5 f1 ff 48 89 ef e8 2d 20 ff ff 85 c0 74
[2021-11-23 21:05:55] [    7.679402] RSP: 0018:ffffb1f1002cfc30 EFLAGS: 00010206
[2021-11-23 21:05:55] [    7.679414] RAX: ffffb1f120cdf000 RBX: ffff8b4d9a675620 RCX: 0000000000000000
[2021-11-23 21:05:55] [    7.679429] RDX: 000000000000000e RSI: 0000000000000003 RDI: ffff8b4d9a660000
[2021-11-23 21:05:55] [    7.679444] RBP: ffff8b4d9a660000 R08: 000000000000000f R09: 000000008010000f
[2021-11-23 21:05:55] [    7.679459] R10: 0000000040000000 R11: 000000001b99d000 R12: ffff8b4d9a675590
[2021-11-23 21:05:55] [    7.679474] R13: ffff8b4d9a676400 R14: 000000000000000c R15: ffff8b4d813ef36c
[2021-11-23 21:05:55] [    7.679490] FS:  000073bc16d48380(0000) GS:ffff8b4dbcc00000(0000) knlGS:0000000000000000
[2021-11-23 21:05:55] [    7.679507] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2021-11-23 21:05:55] [    7.679520] CR2: ffffb1f120cdf000 CR3: 0000000004160000 CR4: 0000000000350ef0
[2021-11-23 21:05:55] [    7.679536] Call Trace:
[2021-11-23 21:05:55] [    7.679545]  amdgpu_device_ip_fini.isra.0+0xb6/0x1e0 [amdgpu]
[2021-11-23 21:05:55] [    7.679691]  amdgpu_device_fini_sw+0xe/0x100 [amdgpu]
[2021-11-23 21:05:55] [    7.679835]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[2021-11-23 21:05:55] [    7.679978]  devm_drm_dev_init_release+0x3d/0x60 [drm]
[2021-11-23 21:05:55] [    7.680008]  devres_release_all+0xb8/0x100
[2021-11-23 21:05:55] [    7.680019]  really_probe+0x100/0x310
[2021-11-23 21:05:55] [    7.680029]  __driver_probe_device+0xfe/0x180
[2021-11-23 21:05:55] [    7.680040]  driver_probe_device+0x1e/0x90
[2021-11-23 21:05:55] [    7.680050]  __driver_attach+0xc0/0x1c0
[2021-11-23 21:05:55] [    7.680059]  ? __device_attach_driver+0xe0/0xe0
[2021-11-23 21:05:55] [    7.680070]  ? __device_attach_driver+0xe0/0xe0
[2021-11-23 21:05:55] [    7.680081]  bus_for_each_dev+0x89/0xd0
[2021-11-23 21:05:55] [    7.680090]  bus_add_driver+0x12b/0x1e0
[2021-11-23 21:05:55] [    7.680099]  driver_register+0x8f/0xe0
[2021-11-23 21:05:55] [    7.680109]  ? 0xffffffffc0e7b000
[2021-11-23 21:05:55] [    7.680117]  do_one_initcall+0x57/0x200
[2021-11-23 21:05:55] [    7.680128]  do_init_module+0x5c/0x260
[2021-11-23 21:05:55] [    7.680137]  __do_sys_finit_module+0xae/0x110
[2021-11-23 21:05:55] [    7.680149]  do_syscall_64+0x3b/0x90
[2021-11-23 21:05:55] [    7.680158]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[2021-11-23 21:05:55] [    7.680170] RIP: 0033:0x73bc17ce9edd
[2021-11-23 21:05:55] [    7.680180] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 7f 0c 00 f7 d8 64 89 01 48
[2021-11-23 21:05:55] [    7.680215] RSP: 002b:00007fffa9b51688 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[2021-11-23 21:05:55] [    7.680231] RAX: ffffffffffffffda RBX: 0000602da93e3120 RCX: 000073bc17ce9edd
[2021-11-23 21:05:55] [    7.680246] RDX: 0000000000000000 RSI: 000073bc17e2732c RDI: 0000000000000014
[2021-11-23 21:05:55] [    7.680260] RBP: 0000000000020000 R08: 0000000000000000 R09: 0000602da93e3bb0
[2021-11-23 21:05:55] [    7.680275] R10: 0000000000000014 R11: 0000000000000246 R12: 000073bc17e2732c
[2021-11-23 21:05:55] [    7.680290] R13: 0000602da9338960 R14: 0000000000000007 R15: 0000602da93e4000
[2021-11-23 21:05:55] [    7.680306] Modules linked in: joydev intel_rapl_msr amdgpu(+) intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ip6table_filter ip6table_mangle ip6table_raw ip6_tables iommu_v2 gpu_sched ipt_REJECT i2c_algo_bit nf_reject_ipv4 drm_ttm_helper ttm xt_state xt_conntrack iptable_filter iptable_mangle iptable_raw drm_kms_helper ehci_pci xt_MASQUERADE iptable_nat nf_nat nf_conntrack ehci_hcd cec nf_defrag_ipv6 serio_raw nf_defrag_ipv4 i2c_piix4 ata_generic pata_acpi pcspkr xen_scsiback target_core_mod xen_netback uinput xen_privcmd xen_gntdev drm xen_gntalloc xen_blkback fuse xen_evtchn bpf_preload ip_tables overlay xen_blkfront
[2021-11-23 21:05:55] [    7.876218] CR2: ffffb1f120cdf000
[2021-11-23 21:05:55] [    7.876227] ---[ end trace 36c4552e098fcc4e ]---
[2021-11-23 21:05:55] [    7.876239] RIP: 0010:vcn_v2_0_sw_fini+0x10/0x40 [amdgpu]
[2021-11-23 21:05:55] [    7.876400] Code: 66 f0 83 c2 81 c6 ea 05 00 00 31 c9 4c 89 cf e9 b6 4d ee ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 8b 87 38 17 01 00 48 89 fd <c7> 00 00 00 00 00 e8 d5 d5 f1 ff 48 89 ef e8 2d 20 ff ff 85 c0 74
[2021-11-23 21:05:55] [    7.876439] RSP: 0018:ffffb1f1002cfc30 EFLAGS: 00010206
[2021-11-23 21:05:55] [    7.876451] RAX: ffffb1f120cdf000 RBX: ffff8b4d9a675620 RCX: 0000000000000000
[2021-11-23 21:05:55] [    7.876467] RDX: 000000000000000e RSI: 0000000000000003 RDI: ffff8b4d9a660000
[2021-11-23 21:05:55] [    7.876483] RBP: ffff8b4d9a660000 R08: 000000000000000f R09: 000000008010000f
[2021-11-23 21:05:55] [    7.876500] R10: 0000000040000000 R11: 000000001b99d000 R12: ffff8b4d9a675590
[2021-11-23 21:05:55] [    7.876515] R13: ffff8b4d9a676400 R14: 000000000000000c R15: ffff8b4d813ef36c
[2021-11-23 21:05:55] [    7.876533] FS:  000073bc16d48380(0000) GS:ffff8b4dbcc00000(0000) knlGS:0000000000000000
[2021-11-23 21:05:55] [    7.876551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2021-11-23 21:05:55] [    7.876565] CR2: ffffb1f120cdf000 CR3: 0000000004160000 CR4: 0000000000350ef0
[2021-11-23 21:05:55] [    7.876582] Kernel panic - not syncing: Fatal exception
[2021-11-23 21:05:55] [    7.877654] Kernel Offset: 0x1000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

in stubdom:

[2021-11-23 21:05:55] qemu[195]: segfault at 0 ip 00005caaf4d1a060 sp 00007fffa06b82b8 error 4 in qemu[5caaf4a9f000+3e9000]
[2021-11-23 21:05:55] Code: 48 8b 4c 24 20 e8 e0 3b 0f 00 48 83 c4 20 e9 a4 fe ff ff 0f 1f 80 00 00 00 00 48 8b 07 48 8b 00 48 8b 00 c3 66 0f 1f 44 00 00 <48> 8b 07 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 8b 07 0f b6 40

The kernel crash seems not too deep: the IPs are initialized in order, it is a PSP init failure that causes to stop and cleanup, the crash appears in vcn_v2_0_sw_fini dereferencing a fw_shared_cpu_addr pointer initialized during VCN init. When the fault occurs the pointer is non-NULL, could be a use-after-free ?

Quite some things to investigate and try next:

  • check if that bug still happens in 5.15/5.16rc; if still there use this occasion to play with KASAN – but it may not be that nuch of a blocker if I can…
  • … avoid use of PSP (move away _ta and _asd firmwares, or use module params ip_block_mask or fw_load_type)
  • check whether the suspect-looking points in former post have an impact here