Preventing implicit FLR when using sys-usb + USB keyboard

I have a question, which is, what is the recommended way to prevent FLRs from being sent during early boot to USB controllers? (emphasis because USB controllers are treated differently at boot when using sys-usb - details below)

I should add: Qubes 4.2; functional USB keyboard and USB boot on a different controller on a different bus and iommu group. They’re not the cause, but their presence complicates some solutions. 3 of 5 USB controllers work fine in sys-usb. And spoiler, I have a workaround, inspired by this proxmox post. Also, the issue with the controllers is not caused by IOMMU groups that I could tell, I checked into that already

The Problem

I have two misbehaving onboard PCI USB controllers that are not amenable to FLR, despite advertising the capability

Once FLR is received, the controller goes into a bad state that can’t be recovered from without a host reboot. Unbind, reset, assigning to pci-stub, pciback, etc. didn’t seem to help get it back into a working state

Summary

Maybe a bus-level reset would work, but I’m not sure of that, and would like to avoid it. There’s no need for a reset in my case, I’m not worried about the window of time between boot and assignment to pciback/xhci_pci and startup of sys-usb

Ditto with reset via the bridge using setpci, as mentioned here

Sending FLR to try to restore functionality obviously doesn’t help, and though I can bind it to the pciback driver when in the bad state, lspci shows output suggestive of a corrupt state. More importantly, starting a VM with it attached causes libvirt to choke when it sees the data that I see plainly in lspci

I can add the exact error later from libvirt and the output from lspci, though I don’t know that I want to go back down the rabbit-hole of finding and solving the low-level quirks at the root of the issue

Edit: The error is the same as mentioned here, invalid PCI header ‘127’

I spent many, many hours reading Qubes and Xen documentation the past 2 weeks and can’t do it anymore

Not The Solution

The solution is not no_strict_reset, because the problem manifests in the initramfs stage, during the time that Qubes is assigning USB controllers to the pciback driver. As far as I can tell, the sysfs unbind operation (or maybe the operation to bind it to a different driver) implicitly causes the FLRs to be sent. And that is, as Joanna would say, “game over” for that controller in this case

The Solution / Workaround

If there’s one good thing to come of this frustrating experience, it’s that I learned exactly how the Qubes PCI hiding works

It boils down to a simple shell script interacting with sysfs (in initramfs, as implied by the “rd” in “rd.qubes.hide”:

With the following modification to that script, the kernel is led to believe that the function doesn’t have FLR (or any) reset mechanisms

# Fool the kernel into thinking there are
# no reset mechanisms available for
# the specified BDFs, to prevent implicit
# FLR requests from being sent during
# bind/unbind to/from a driver via sysfs
echo "" > /sys/bus/pci/devices/0000:$BDF1/reset_method
echo "" > /sys/bus/pci/devices/0000:$BDF2/reset_method

With those two lines placed before any sysfs operations occur, the problematic devices are successfully given to the pciback driver via the sysfs operations on each BDF without causing an FLR. The boot completes as normal, and the devices can be handed over to sys-usb and the peasants rejoice

Edit: you also need a dracut -f after the changes, to rebuild the initramfs with the modifications made

Summary

It would be nice if there was a rd.qubes.no-_flr=bdf1[,bdf2]…. that made this cleaner out of the box, and by it’s existence documented this problem as “a thing”. I’m not going to send a PR with that until I’m sure there’s not some simple solution that I simply failed to find or use correctly

Summary

As I mentioned, I spent a significant amount of time trying to find “proper” solutions, that didn’t involve modifying the pciback script - mostly in the form of Xen or kernel command line options. I didn’t have any success with any of them

I have a few other general ideas about how to solve this, but I suspect someone here can immediately give me the best way to do so without much thought to it

Forcing these controllers to pciback or pci-stub by BDF, before the referenced pciback script runs, may be what I want?

I can’t simply blacklist the driver that claims these (xhci_pci) because one of my USB controllers needs to be claimed by xhci_pci to operate properly. Normally I use udev for things related to driver timing and conflicts, but udev is too late for this case

I considered adding the problematic controllers to rd.qubes.dom0_usb, but I’m not sure that will actually help. I am burned out and need to read the script again

tl; dr; as initially stated, what’s the best way to “protect” buggy USB controllers from FLRs caused by the Qubes pciback initramfs script? There should be a clean solution offered by Qubes in my opinion. The workaround is good for now, otherwise, I give up :disappointed:

EDIT: For those curious as to what controllers these may be, to work towards the true root cause (the hardware issue itself) - they’re AMD controllers on WRX90 chipset. I suspect the issue has something to do with the onboard IPMI/BMC. I’ve tried hardware toggling and software toggling (via UEFI) both the BMC functionality in its entirety and the onboard VGA device associated with it but it hasn’t helped the controllers to survive FLRs. I’m happy to do specific things suggested by users but I don’t have time to research further, especially as reboots are expensive time-wise, and toggling via hardware or double-checking IOMMU groups is also expensive

1 Like

To elaborate on why I believe it might be reasonable for Qubes to offer something to accommodate this…

I understand that FLR should not break a controller, especially if it advertises FLR (these controllers do)

However, we have no-strict-reset for qvm-pci which, while technically mapping to existing Xen features, is deliberately exposed via qvm-pci and documented by Qubes. I consider it a Qubes feature offered to users to workaround situations similar to this one

It seems that there should be an additional commandline parameter, rd.qubes.pci_no_flr=bdf1[,bdf2]… that could be handled in either the same pciback script as I modified or as a separtscript invoked prior to that script

I don’t have a GitHub account so I won’t be creating an issue. Regardless, I would like to wait and see what other fixes may be available as an alternative to changes to Qubes. If there’s nothing better than the “solution” I used, maybe some kind soul could create an issue and a PR

Maybe you can use softdep like this:

Blacklist xhci_hcd with modprobe.blacklist=xhci_hcd and add /etc/modprobe.d/01-pciback.conf:

softdep xhci_hcd pre: pciback
options pciback ids=VID:PID

Where VID:PID is a VID:PID of your USB controller that you want to hide.

But I’m not sure if it’ll work for pciback or it’s specific to vfio-pci.

I will give it a shot, thank you for reading my lengthy post!

I knew there were a lot more options/directives/parameters supported in the modprobe configuration, I ought to grok through the docs (or source) at some point, it seems

EDIT: I’m not sure this will actually prevent Qubes pciback script from resetting the device (because it uses lspci) but it’s something I was interested in figuring out how to do with modprobe configs, so it’s a win either way

1 Like

Looks like the pciback module only supports a single option, which is “permissive” (and not too useful, it seems)

However, I think you’re on the right track with investigating lesser used modprobe features

There goes my afternoon!

Thanks again

1 Like

It looks like pci-stub, however, does support the ids parameter

I know Qubes has pci-stub but I’m not certain if it’s in module form. I have used it via sysfs but not via modprobe

Thinking about it now, I’m wondering…

If a device is claimed by pciback (pr pci-stub), what happens for an unbind/remove?

I think only the kernel source knows this for sure, but if those two modules don’t cause FLRs when claiming or releasing a device, then I think what you suggested (with pci-stub in place of pciback) may do the trick, even if Qubes insists on unbinding devices that are already seized by pciback

Only one way to find out

Forgot to update here

No luck doing this with pciback parameters. Sadly, unlike vfio, it isn’t as configurable at load time

What I ended up doing was modifying the aforementioned initramfs script to add a “proper” option that more cleanly facilitates what I hacked in

It uses the same syntax as the other qubes initramfs options that accept BDFs, so I didn’t have to deal with parsing

Using it looks like this

qubes.rd.pci_noflr=bdf1[,bdf2,…]

I’ll post the diff here in case anyone has a use for it, or would like to send a PR upstream. Having it upstream would be really great, I wouldn’t need to ensure that the changes aren’t overwritten by a dom0 update (happened once already, I need to look into how to add a dnf post-update hook, I’ve only done this with apt in Debian before)

Thanks again for the help and discussion

… and to anyone who may come along with a less invasive way to do this, please post it here. I would love to be rid of this hack, even though it’s more cleanly implemented

It seems that, for USB, a patch to the picback shell script may not have been necessary afterall - though I haven’t yet tested this:


usbcore.quirks=
			[USB] A list of quirk entries to augment the built-in
			usb core quirk list. List entries are separated by
			commas. Each entry has the form
			VendorID:ProductID:Flags. The IDs are 4-digit hex
			numbers and Flags is a set of letters. Each letter
			will change the built-in quirk; setting it if it is
			clear and clearing it if it is set. The letters have
			the following meanings:
				a = USB_QUIRK_STRING_FETCH_255 (string
					descriptors must not be fetched using
					a 255-byte read);
				b = USB_QUIRK_RESET_RESUME (device can't resume
					correctly so reset it instead);
				c = USB_QUIRK_NO_SET_INTF (device can't handle
					Set-Interface requests);
				d = USB_QUIRK_CONFIG_INTF_STRINGS (device can't
					handle its Configuration or Interface
					strings);
				e = USB_QUIRK_RESET (device can't be reset
					(e.g morph devices), don't use reset);

The USB_QUIRK_RESET looks useful …

This is limited to USB devices, of course, so you can’t use it to prevent resets from being sent to other PCI devices. But it might have been good enough for my issue

Maybe I did work for nothing, but at least something was learned?

(posting from different account)

This is the patch I had hacked in. A few caveats:

  1. It doesn’'t do any sanity checking on the values
  2. It doesn’t save and then restore the reset_method before/after qubes does it’s reset
  3. It’s not tested by anyone other than me, one my machine
  4. It will get blown away by some dom0 updates eventually

The file is at /usr/lib/dracut/modules.d/90qubes-pciback/qubes-pciback.sh

--- qubes-pciback.sh	2024-08-30 18:22:04.117005953 -0400
+++ qubes-pciback.sh.latest	2024-10-08 22:01:23.715828082 -0400
@@ -20,6 +20,20 @@
 HIDE_PCI=$(set -o pipefail; { lspci -mm -n | awk "/^[^ ]* \"$re/ {print \$1}";}) ||
     die 'Cannot obtain list of PCI devices to unbind.'
 
+# --- pci_noflr hack ---
+noflr_devs=$(getarg rd.qubes.pci_noflr)
+NOFLR_PCI="${noflr_devs//,/ }"
+for dev in $NOFLR_PCI; do
+  BDF=0000:$dev
+  if [ -f "/sys/bus/pci/devices/$BDF/reset_method" ]; then
+    warn "Disabling reset for non-conformant device @ $BDF ..."
+    echo "" > "/sys/bus/pci/devices/$BDF/reset_method"
+  else
+    warn "Unable to disable reset for non-conformant device @ $BDF, no reset_method file present ..."
+  fi
+done
+# --- end pci_noflr hack ---
+
 manual_pcidevs=$(getarg rd.qubes.hide_pci)
 case $manual_pcidevs in
 (*[!0-9a-f.:,]*) warn 'Bogus rd.qubes.hide_pci option - fix your kernel command line!';;

To use it, just add rd.qubes.pci_noflr=00:00.0,00:01.0,..., same syntax as rd.qubes.hide_pci

I don’t necessarily suggest anyone uses this, though. The USB quirks option seems the more appropriate solution

I think usbcore.quirks is for USB devices and not for PCI devices (e.g. not for PCI USB controllers).

Ahhhhh you are 100% correct. Should have known this as I used it recently with a USB mass storage device