I have a question, which is, what is the recommended way to prevent FLRs from being sent during early boot to USB controllers? (emphasis because USB controllers are treated differently at boot when using sys-usb
- details below)
I should add: Qubes 4.2; functional USB keyboard and USB boot on a different controller on a different bus and iommu group. They’re not the cause, but their presence complicates some solutions. 3 of 5 USB controllers work fine in sys-usb. And spoiler, I have a workaround, inspired by this proxmox post. Also, the issue with the controllers is not caused by IOMMU groups that I could tell, I checked into that already
The Problem
I have two misbehaving onboard PCI USB controllers that are not amenable to FLR, despite advertising the capability
Once FLR is received, the controller goes into a bad state that can’t be recovered from without a host reboot. Unbind, reset, assigning to pci-stub, pciback, etc. didn’t seem to help get it back into a working state
Summary
Maybe a bus-level reset would work, but I’m not sure of that, and would like to avoid it. There’s no need for a reset in my case, I’m not worried about the window of time between boot and assignment to pciback/xhci_pci and startup of sys-usb
Ditto with reset via the bridge using setpci, as mentioned here
Sending FLR to try to restore functionality obviously doesn’t help, and though I can bind it to the pciback
driver when in the bad state, lspci
shows output suggestive of a corrupt state. More importantly, starting a VM with it attached causes libvirt to choke when it sees the data that I see plainly in lspci
I can add the exact error later from libvirt and the output from lspci
, though I don’t know that I want to go back down the rabbit-hole of finding and solving the low-level quirks at the root of the issue
Edit: The error is the same as mentioned here, invalid PCI header ‘127’
I spent many, many hours reading Qubes and Xen documentation the past 2 weeks and can’t do it anymore
Not The Solution
The solution is not no_strict_reset
, because the problem manifests in the initramfs stage, during the time that Qubes is assigning USB controllers to the pciback
driver. As far as I can tell, the sysfs unbind
operation (or maybe the operation to bind it to a different driver) implicitly causes the FLRs to be sent. And that is, as Joanna would say, “game over” for that controller in this case
The Solution / Workaround
If there’s one good thing to come of this frustrating experience, it’s that I learned exactly how the Qubes PCI hiding works
It boils down to a simple shell script interacting with sysfs (in initramfs, as implied by the “rd” in “rd.qubes.hide”:
With the following modification to that script, the kernel is led to believe that the function doesn’t have FLR (or any) reset mechanisms
# Fool the kernel into thinking there are
# no reset mechanisms available for
# the specified BDFs, to prevent implicit
# FLR requests from being sent during
# bind/unbind to/from a driver via sysfs
echo "" > /sys/bus/pci/devices/0000:$BDF1/reset_method
echo "" > /sys/bus/pci/devices/0000:$BDF2/reset_method
With those two lines placed before any sysfs operations occur, the problematic devices are successfully given to the pciback
driver via the sysfs operations on each BDF without causing an FLR. The boot completes as normal, and the devices can be handed over to sys-usb and the peasants rejoice
Edit: you also need a dracut -f
after the changes, to rebuild the initramfs with the modifications made
Summary
It would be nice if there was a rd.qubes.no-_flr=bdf1[,bdf2]
…. that made this cleaner out of the box, and by it’s existence documented this problem as “a thing”. I’m not going to send a PR with that until I’m sure there’s not some simple solution that I simply failed to find or use correctly
Summary
As I mentioned, I spent a significant amount of time trying to find “proper” solutions, that didn’t involve modifying the pciback script - mostly in the form of Xen or kernel command line options. I didn’t have any success with any of them
I have a few other general ideas about how to solve this, but I suspect someone here can immediately give me the best way to do so without much thought to it
Forcing these controllers to pciback
or pci-stub
by BDF, before the referenced pciback script runs, may be what I want?
I can’t simply blacklist the driver that claims these (xhci_pci
) because one of my USB controllers needs to be claimed by xhci_pci
to operate properly. Normally I use udev for things related to driver timing and conflicts, but udev is too late for this case
I considered adding the problematic controllers to rd.qubes.dom0_usb
, but I’m not sure that will actually help. I am burned out and need to read the script again
tl; dr; as initially stated, what’s the best way to “protect” buggy USB controllers from FLRs caused by the Qubes pciback initramfs script? There should be a clean solution offered by Qubes in my opinion. The workaround is good for now, otherwise, I give up
EDIT: For those curious as to what controllers these may be, to work towards the true root cause (the hardware issue itself) - they’re AMD controllers on WRX90 chipset. I suspect the issue has something to do with the onboard IPMI/BMC. I’ve tried hardware toggling and software toggling (via UEFI) both the BMC functionality in its entirety and the onboard VGA device associated with it but it hasn’t helped the controllers to survive FLRs. I’m happy to do specific things suggested by users but I don’t have time to research further, especially as reboots are expensive time-wise, and toggling via hardware or double-checking IOMMU groups is also expensive