Hunting for Qubes compatibility issues on MSI Bravo 17

My original HCL for reference: [qubes-users] HCL - MSI Bravo 17

I can boot this laptop on with a Debian 11 live (xfce+nonfree) USB, and I can see that debian’s linux 5.10.46:

  • properly supports VFIO
  • properly suspends and resumes

… while Qubes 4.1 with 5.10.47 (and 5.12) does not.

Some collected information follows, I’ll dig from there, but if anyone can suggest experiments and/or tell that some of those differences are normal and harmless it would help me to focus on the right stuff.

At first sight I suspect a link between the memory stuff and suspend issue, with differences in MTRR and e820 handling being one suspect - I’m not too familiar with how the hypervisor plays with those.

On VFIO side I’m planning to activate/add some traces to understand why it does not see the IOMMU.

dmesg

The dmesg diff notably shows:

  • a memory region that Qubes shows as reserved but Debian shows as slightly different:

    -[    0.000000] BIOS-e820: [mem 0x00000000ab98e000-0x00000000ad579fff] reserved
    -[    0.000000] BIOS-e820: [mem 0x00000000ad57a000-0x00000000ad5fefff] type 20
    +[    0.000000] Xen: [mem 0x00000000ab98e000-0x00000000ad5fefff] reserved
    
  • Debian shows more info from e820 about memory ranges, with an impact on hibernation:

    -[    0.000438] e820: update [mem 0xb0000000-0xffffffff] usable ==> reserved
    -[    0.003572] esrt: Reserving ESRT space from 0x00000000a8c45c18 to 0x00000000a8c45c50.
    -[    0.003580] e820: update [mem 0xa8c45000-0xa8c45fff] usable ==> reserved
    -[    0.003592] e820: update [mem 0xa5bd0000-0xa5bd2fff] usable ==> reserved
    -[    0.003624] Using GB pages for direct mapping
    -[    0.010519] e820: update [mem 0xa6334000-0xa6427fff] usable ==> reserved
    -[    0.010526] smpboot: Allowing 16 CPUs, 0 hotplug CPUs
    -[    0.010543] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
    -[    0.010544] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]
    -[    0.010545] PM: hibernation: Registered nosave memory: [mem 0x09bff000-0x09ffffff]
    -[    0.010546] PM: hibernation: Registered nosave memory: [mem 0x0a200000-0x0a20cfff]
    -[    0.010548] PM: hibernation: Registered nosave memory: [mem 0xa5bd0000-0xa5bd2fff]
    -[    0.010549] PM: hibernation: Registered nosave memory: [mem 0xa6334000-0xa6427fff]
    -[    0.010550] PM: hibernation: Registered nosave memory: [mem 0xa8c45000-0xa8c45fff]
    -[    0.010551] PM: hibernation: Registered nosave memory: [mem 0xaa26b000-0xab788fff]
    -[    0.010551] PM: hibernation: Registered nosave memory: [mem 0xab789000-0xab7d9fff]
    -[    0.010551] PM: hibernation: Registered nosave memory: [mem 0xab7da000-0xab98dfff]
    -[    0.010552] PM: hibernation: Registered nosave memory: [mem 0xab98e000-0xad579fff]
    -[    0.010552] PM: hibernation: Registered nosave memory: [mem 0xad57a000-0xad5fefff]
    -[    0.010553] PM: hibernation: Registered nosave memory: [mem 0xae000000-0xafffffff]
    -[    0.010554] PM: hibernation: Registered nosave memory: [mem 0xb0000000-0xefffffff]
    -[    0.010554] PM: hibernation: Registered nosave memory: [mem 0xf0000000-0xf7ffffff]
    -[    0.010554] PM: hibernation: Registered nosave memory: [mem 0xf8000000-0xfcffffff]
    -[    0.010555] PM: hibernation: Registered nosave memory: [mem 0xfd000000-0xffffffff]
    
  • some different values in an EFI report:

    -[    0.000000] efi: ACPI=0xab977000 ACPI 2.0=0xab977014 TPMFinalLog=0xab946000 SMBIOS=0xad429000 SMBIOS 3.0=0xad428000 MEMATTR=0xa67cc118 ESRT=0xa8c45c18 MOKvar=0xa5bd0000 
    +[    0.000000] efi: ACPI=0xab977000 ACPI 2.0=0xab977014 TPMFinalLog=0xab946000 SMBIOS=0xad429000 SMBIOS 3.0=0xad428000 MEMATTR=0xa6429698 ESRT=0xa8c2d018 
    
  • secure boot enabled on Debian and not on Qubes:

    -[    0.000000] secureboot: Secure boot could not be determined (mode 0)
    +[    0.251353] Secure boot disabled
    -[    1.098378] Loaded X.509 cert 'Debian Secure Boot CA: 6ccece7e4c6c0d1f6149f3dd27dfcc5cbb419ea1'
    -[    1.098397] Loaded X.509 cert 'Debian Secure Boot Signer 2021 - linux: 4b6ef5abca669825178e052c84667ccbc0531f8c'
    
  • Qubes has MTRR disabled, impacting PAT configuration:

    -[    0.000136] MTRR default type: uncachable
    -[    0.000136] MTRR fixed ranges enabled:
    -[    0.000137]   00000-9FFFF write-back
    -[    0.000138]   A0000-DFFFF uncachable
    -[    0.000138]   E0000-FFFFF write-protect
    -[    0.000139] MTRR variable ranges enabled:
    -[    0.000140]   0 base 000000000000 mask FFFF80000000 write-back
    -[    0.000140]   1 base 000080000000 mask FFFFE0000000 write-back
    -[    0.000141]   2 base 0000A0000000 mask FFFFF0000000 write-back
    -[    0.000141]   3 disabled
    -[    0.000142]   4 disabled
    -[    0.000142]   5 disabled
    -[    0.000142]   6 disabled
    -[    0.000143]   7 disabled
    -[    0.000143] TOM2: 0000000450000000 aka 17664M
    -[    0.000337] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
    +[    0.025910] x86/PAT: MTRRs disabled, skipping PAT initialization too.
    +[    0.025913] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC  
    
  • PSP show different initialization issues on both platforms, but still appear to be used for forware loading in both cases:

    -[    1.423015] [drm] add ip block number 3 <psp>
    -[    1.440204] [drm] PSP loading VCN firmware
    -[    2.372998] [drm] Loading DMUB firmware via PSP: version=0x00000000
    -[    2.373090] [drm] PSP loading VCN firmware
    +[    3.342535] ccp 0000:07:00.2: tee: ring init command failed (0x00000005)
    +[    3.343355] ccp 0000:07:00.2: tee: failed to init ring buffer
    +[    3.344155] ccp 0000:07:00.2: tee initialization failed
    +[    3.345388] ccp 0000:07:00.2: psp initialization failed
    +[    3.464296] [drm] add ip block number 3 <psp>
    +[    3.500352] [drm] Loading DMUB firmware via PSP: version=0x00000000
    +[    3.500456] [drm] PSP loading VCN firmware
    -[    6.399534] ccp 0000:07:00.2: enabling device (0000 -> 0002)
    -[    6.399659] ccp 0000:07:00.2: ccp: unable to access the device: you might be running a broken BIOS.
    -[    6.409802] ccp 0000:07:00.2: tee enabled
    -[    6.409805] ccp 0000:07:00.2: psp enabled
    
  • Debian shows direct-loading of many firmware blobs, while Qubes shows virtually none:

    -[    1.437206] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_sos.bin
    -[    1.437279] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_asd.bin
    -[    1.437305] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_ta.bin
    -[    1.437395] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_smc.bin
    -[    1.437623] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_pfp.bin
    -[    1.437742] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_me.bin
    -[    1.437839] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_ce.bin
    -[    1.437869] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_rlc.bin
    -[    1.437963] amdgpu 0000:03:00.0: firmware: direct-loading firmware   amdgpu/navi14_mec.bin
    -[    1.438065] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_mec2.bin
    -[    1.439979] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_sdma.bin
    -[    1.440007] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_sdma1.bin
    -[    1.440198] amdgpu 0000:03:00.0: firmware: direct-loading firmware amdgpu/navi14_vcn.bin
    ...
    -[    2.370891] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_sdma.bin
    ...
    -[    2.371209] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_asd.bin
    -[    2.371223] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_ta.bin
    -[    2.371239] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_pfp.bin
    -[    2.371247] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_me.bin
    -[    2.371255] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_ce.bin
    -[    2.371267] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_rlc.bin
    -[    2.371320] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_mec.bin
    -[    2.371370] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_mec2.bin
    -[    2.372995] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_dmcub.bin
    -[    2.372998] [drm] Loading DMUB firmware via PSP: version=0x00000000
    -[    2.373086] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/renoir_vcn.bin
    ...
    -[    6.538034] platform regulatory.0: firmware: direct-loading firmware regulatory.db
    -[    6.565324] platform regulatory.0: firmware: direct-loading firmware regulatory.db.p7s
    +[   16.699469] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
    
  • IOMMU: the kernel simply believes there is such available feature

    -[    1.046733] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
    -[    1.046883] pci 0000:00:00.2: can't derive routing for PCI INT A
    -[    1.046884] pci 0000:00:00.2: PCI INT A: not connected
    -[    1.046919] pci 0000:00:01.0: Adding to iommu group 0
    ...
    -[    1.047326] pci 0000:08:00.1: Adding to iommu group 6
    -[    1.048767] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
    -[    1.048769] pci 0000:00:00.2: AMD-Vi: Extended features (0x206d73ef22254ade):
    -[    1.048770]  PPR X2APIC NX GT IA GA PC GA_vAPIC
    -[    1.048772] AMD-Vi: Interrupt remapping enabled
    -[    1.048772] AMD-Vi: Virtual APIC enabled
    -[    1.048772] AMD-Vi: X2APIC enabled
    -[    1.049006] AMD-Vi: Lazy IO/TLB flushing enabled
    ...
    -[    1.052192] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
    ...
    -[    1.064867] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
    +[    3.347333] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
    +[    3.348102] AMD-Vi: AMD IOMMUv2 functionality not available on this system
    

There are many more diffs, but those probably give quite some food for thought already.

cpuinfo

  • the “power management” field is empty on Qubes, and on Debian has ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
  • in cpu flags, Qubes gets hypervisor tsc_known_freq, likely from Xen, but loses vme pse sep mtrr pge pse36 pdpe1gb aperfmperf monitor svm extapic cr8_legacy osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme mba sev ibrs stibp sev_es smep cqm rdt_a smap xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local irperf rdpru wbnoinvd npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

There is probably a link between some behaviours observed in the log and some missing flags. Eg. is it normal that we don’t see mtrr flag here ?

1 Like

Thinking a bit about that… it would seem to be the hypervisor’s job to handle IOMMU, and in fact, /var/log/xen/console/hypervisor.log does show:

[2021-09-11 17:04:58] (XEN) AMD-Vi: IOMMU 0 Enabled.
[2021-09-11 17:04:58] (XEN) I/O virtualisation enabled
[2021-09-11 17:04:58] (XEN)  - Dom0 mode: Relaxed

So of the iommu is already managed by the hypervisor what’s taking place in dom0 that needs to have iommu drivers there too, @marmarek ?

It is Xen’s task to manage IOMMU, not dom0, so it is correct that dom0 kernel reports it as unavailable.

Great, but then do we need the iommu drivers in the dom0 kernel ?

Not necessarily, but also it doesn’t hurt. And not having them will harm those (wanting to) work on a KVM port.