Preventing implicit FLR when using sys-usb + USB keyboard

I have a question, which is, what is the recommended way to prevent FLRs from being sent during early boot to USB controllers? (emphasis because USB controllers are treated differently at boot when using sys-usb - details below)

I should add: Qubes 4.2; functional USB keyboard and USB boot on a different controller on a different bus and iommu group. They’re not the cause, but their presence complicates some solutions. 3 of 5 USB controllers work fine in sys-usb. And spoiler, I have a workaround, inspired by this proxmox post. Also, the issue with the controllers is not caused by IOMMU groups that I could tell, I checked into that already

The Problem

I have two misbehaving onboard PCI USB controllers that are not amenable to FLR, despite advertising the capability

Once FLR is received, the controller goes into a bad state that can’t be recovered from without a host reboot. Unbind, reset, assigning to pci-stub, pciback, etc. didn’t seem to help get it back into a working state

Summary

Maybe a bus-level reset would work, but I’m not sure of that, and would like to avoid it. There’s no need for a reset in my case, I’m not worried about the window of time between boot and assignment to pciback/xhci_pci and startup of sys-usb

Ditto with reset via the bridge using setpci, as mentioned here

Sending FLR to try to restore functionality obviously doesn’t help, and though I can bind it to the pciback driver when in the bad state, lspci shows output suggestive of a corrupt state. More importantly, starting a VM with it attached causes libvirt to choke when it sees the data that I see plainly in lspci

I can add the exact error later from libvirt and the output from lspci, though I don’t know that I want to go back down the rabbit-hole of finding and solving the low-level quirks at the root of the issue

Edit: The error is the same as mentioned here, invalid PCI header ‘127’

I spent many, many hours reading Qubes and Xen documentation the past 2 weeks and can’t do it anymore

Not The Solution

The solution is not no_strict_reset, because the problem manifests in the initramfs stage, during the time that Qubes is assigning USB controllers to the pciback driver. As far as I can tell, the sysfs unbind operation (or maybe the operation to bind it to a different driver) implicitly causes the FLRs to be sent. And that is, as Joanna would say, “game over” for that controller in this case

The Solution / Workaround

If there’s one good thing to come of this frustrating experience, it’s that I learned exactly how the Qubes PCI hiding works

It boils down to a simple shell script interacting with sysfs (in initramfs, as implied by the “rd” in “rd.qubes.hide”:

With the following modification to that script, the kernel is led to believe that the function doesn’t have FLR (or any) reset mechanisms

# Fool the kernel into thinking there are
# no reset mechanisms available for
# the specified BDFs, to prevent implicit
# FLR requests from being sent during
# bind/unbind to/from a driver via sysfs
echo "" > /sys/bus/pci/devices/0000:$BDF1/reset_method
echo "" > /sys/bus/pci/devices/0000:$BDF2/reset_method

With those two lines placed before any sysfs operations occur, the problematic devices are successfully given to the pciback driver via the sysfs operations on each BDF without causing an FLR. The boot completes as normal, and the devices can be handed over to sys-usb and the peasants rejoice

Edit: you also need a dracut -f after the changes, to rebuild the initramfs with the modifications made

Summary

It would be nice if there was a rd.qubes.no-_flr=bdf1[,bdf2]…. that made this cleaner out of the box, and by it’s existence documented this problem as “a thing”. I’m not going to send a PR with that until I’m sure there’s not some simple solution that I simply failed to find or use correctly

Summary

As I mentioned, I spent a significant amount of time trying to find “proper” solutions, that didn’t involve modifying the pciback script - mostly in the form of Xen or kernel command line options. I didn’t have any success with any of them

I have a few other general ideas about how to solve this, but I suspect someone here can immediately give me the best way to do so without much thought to it

Forcing these controllers to pciback or pci-stub by BDF, before the referenced pciback script runs, may be what I want?

I can’t simply blacklist the driver that claims these (xhci_pci) because one of my USB controllers needs to be claimed by xhci_pci to operate properly. Normally I use udev for things related to driver timing and conflicts, but udev is too late for this case

I considered adding the problematic controllers to rd.qubes.dom0_usb, but I’m not sure that will actually help. I am burned out and need to read the script again

tl; dr; as initially stated, what’s the best way to “protect” buggy USB controllers from FLRs caused by the Qubes pciback initramfs script? There should be a clean solution offered by Qubes in my opinion. The workaround is good for now, otherwise, I give up :disappointed:

EDIT: For those curious as to what controllers these may be, to work towards the true root cause (the hardware issue itself) - they’re AMD controllers on WRX90 chipset. I suspect the issue has something to do with the onboard IPMI/BMC. I’ve tried hardware toggling and software toggling (via UEFI) both the BMC functionality in its entirety and the onboard VGA device associated with it but it hasn’t helped the controllers to survive FLRs. I’m happy to do specific things suggested by users but I don’t have time to research further, especially as reboots are expensive time-wise, and toggling via hardware or double-checking IOMMU groups is also expensive

To elaborate on why I believe it might be reasonable for Qubes to offer something to accommodate this…

I understand that FLR should not break a controller, especially if it advertises FLR (these controllers do)

However, we have no-strict-reset for qvm-pci which, while technically mapping to existing Xen features, is deliberately exposed via qvm-pci and documented by Qubes. I consider it a Qubes feature offered to users to workaround situations similar to this one

It seems that there should be an additional commandline parameter, rd.qubes.pci_no_flr=bdf1[,bdf2]… that could be handled in either the same pciback script as I modified or as a separtscript invoked prior to that script

I don’t have a GitHub account so I won’t be creating an issue. Regardless, I would like to wait and see what other fixes may be available as an alternative to changes to Qubes. If there’s nothing better than the “solution” I used, maybe some kind soul could create an issue and a PR

Maybe you can use softdep like this:

Blacklist xhci_hcd with modprobe.blacklist=xhci_hcd and add /etc/modprobe.d/01-pciback.conf:

softdep xhci_hcd pre: pciback
options pciback ids=VID:PID

Where VID:PID is a VID:PID of your USB controller that you want to hide.

But I’m not sure if it’ll work for pciback or it’s specific to vfio-pci.

I will give it a shot, thank you for reading my lengthy post!

I knew there were a lot more options/directives/parameters supported in the modprobe configuration, I ought to grok through the docs (or source) at some point, it seems

EDIT: I’m not sure this will actually prevent Qubes pciback script from resetting the device (because it uses lspci) but it’s something I was interested in figuring out how to do with modprobe configs, so it’s a win either way

1 Like

Looks like the pciback module only supports a single option, which is “permissive” (and not too useful, it seems)

However, I think you’re on the right track with investigating lesser used modprobe features

There goes my afternoon!

Thanks again

1 Like

It looks like pci-stub, however, does support the ids parameter

I know Qubes has pci-stub but I’m not certain if it’s in module form. I have used it via sysfs but not via modprobe

Thinking about it now, I’m wondering…

If a device is claimed by pciback (pr pci-stub), what happens for an unbind/remove?

I think only the kernel source knows this for sure, but if those two modules don’t cause FLRs when claiming or releasing a device, then I think what you suggested (with pci-stub in place of pciback) may do the trick, even if Qubes insists on unbinding devices that are already seized by pciback

Only one way to find out