AMD-Vi IO_PAGE_FAULT, crashes

Hello,

i am having many random crashes on my ThinkPad L14 G2 with a Ryzen 7 PRO 5850U that are most probably caused by my wifi chip, and where not present on Qubes 4.0.

Here is my forum thread getting to the realisation, that it is the wifi chip, Intel 8265NGW. Exact description of the behavior is there too.

There are very many io page faults in the /var/log/xen/console/hypervisor.log and all look exactly like this:

[2022-04-10 11:52:16] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I

0000:03:00.0 is my wifi chip:
lscpi -v -s 0:03:00.0:

	Subsystem: Intel Corporation Dual Band Wireless-AC 8265
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at fd600000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number <scrubbed>
	Capabilities: [14c] Latency Tolerance Reporting
	Capabilities: [154] L1 PM Substates
	Kernel driver in use: pciback
	Kernel modules: iwlwifi

Actions taken:

  1. Installing the patch for XSA-399, XSA-400
  2. Tried with UEFI activated/deactivated: Other PCI devices, UEFI network stack, security chip, virtualization, memory protection: Same behaviour, or more errors (when disabling virtualization)
  3. Tried with another wifi chip of the same kind: Same behaviour
  4. Tried with never kernel (5.16.13.2): Same behaviour
  5. Tried with iommu=soft, more memory for sys-net: Same Behaviour
  6. Tried with whole qubes iommu=soft option: Same behaviour
  7. Tried with a Realtek Wifi nvme chip: No page faults, but no network either.
  8. Tried with UEFI disabled nvme wifi + USB wifi: No page faults, but inacceptable performance (throughput good, ping good, page loading takes 30 seconds or fails, have not troubleshootet the problems with the USB wifi adapter further)

How do i troubleshoot the io page faults further?
What can i try to mitigate this?
Is this a xen or linux kernel bug?

Assuming this is a bug that will not be fixed for a while and there is no mitigation or hotfix: How likely is it, that i can buy another wifi chip to use and it will work? If so, what chip would be best™/working?

1 Like

Problem persists.
I ordered an Intel AX210 wifi chip to test it with that and reinstalled 4.1.

Things worthy of note:

  1. One has to disable it while installing, otherwise the installer will hang on “configuring network”
  2. One has to apply this hack to make it work.

I think i am getting less IO_PAGE_FAULTs but it is hard to say for certain. Also i do have yet to see a hard crash with this network card.

So i tried multiple other things with the new wifi card installed:

  1. Set kernelopts of sys-net to “iommu=ps” = no network at all
  2. Set kernelopts of sys-net to “iommu=verbose” = no effect at all. It does not get more verbose (or where should the verbosity manifest?)
  3. Set kernelopts of sys-net to “iommu=solft virtiotlb=around 16k memory” = no effect.
  4. Enabling bluetooth in the BIOS = more bugs[1]

[1]: The AX210 is a dual wifi + bluetooth card. when enabling bluetooth in the BIOS, all USB ports cease function.
Additionally i get this behaviour permanently. “Device : is available, Device : is removed”.
It is very annoying, but also does throw IO_PAGE_FAULTS. As i have Bluetooth disabled usually this is not a problem for me, but maybe related.

I have seen a lot of page faults now, monitoring these log files in real time:

  • dom0 journalct -xef
  • dom0 dmesg -w
  • dom0 tail -f /var/log/xen/console/hypervisor.log
  • sys-net dmesg -w
  • sys-net journalctl -xef
  • sys-net tail -f /var/log/xen/xend.log

When the page faults happen in the /var/log/xen/console/hypervisor.log, there is nothing indicating any error in any other logs, nor is there any correlating event, service, whatever.

The page faults seem to be more likely, when using the wifi.

I once i observed another IO_PAGE_FAULT from another domain (d0), that is likely to be my SSD. Also another IO_PAGE_FAULT at around the same time from d20 to a device i was unable to pinpoint.

Does this indicate a hardware fault of my device?

I went through the journals of the boot process and noticed something that does not seem to be critical, but maybe somebody can tell me how to fix it/ if it is related:

dom0 kernel: ACPI BIOS Warning (bug): Incorrect checksum in table [BGRT] - 0x1B, should be 0x36 (20200925/tbprint-173)

After the many crashes i updated my BIOS as one of the first things i tried. Maybe this is related? Should i try to reflash the BIOS?

dom0 kernel: cpu 0 spinlock event irq 57
dom0 kernel: VPMU disabled by hypervisor.
dom0 kernel: Performance Events: PMU not available due to virtualization, using software events only.
dom0 kernel: rcu: Hierarchical SRCU implementation.
dom0 kernel: NMI watchdog: Perf NMI watchdog permanently disabled
dom0 kernel: smp: Bringing up secondary CPUs ...
dom0 kernel: installing Xen timer for CPU 1
dom0 kernel: cpu 1 spinlock event irq 67
dom0 kernel: installing Xen timer for CPU 2
dom0 kernel: cpu 2 spinlock event irq 73
dom0 kernel: installing Xen timer for CPU 3
dom0 kernel: cpu 3 spinlock event irq 79
dom0 kernel: installing Xen timer for CPU 4
dom0 kernel: cpu 4 spinlock event irq 85
dom0 kernel: installing Xen timer for CPU 5
dom0 kernel: cpu 5 spinlock event irq 91
dom0 kernel: installing Xen timer for CPU 6
dom0 kernel: cpu 6 spinlock event irq 97
dom0 kernel: installing Xen timer for CPU 7
dom0 kernel: cpu 7 spinlock event irq 103
dom0 kernel: smp: Brought up 1 node, 8 CPUs

Does not look critical to me, but it is yellow. Is this a problem?

dom0 kernel: xen:grant_table: Grant tables using version 1 layout
dom0 kernel: Grant table initialized
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL pool for atomic allocations
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations

The second line is yellow as well.

dom0 kernel: ACPI: Power Button [PWRF]
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C000: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C001: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C002: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C003: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C004: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C005: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C006: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

The “Firmwar Bug” lines are yellow. I think this is related to the missing Power states, not working S3 or something other powerish. Maybe this is important, i don’t know.

dom0 kernel: xen: registering gsi 36 triggering 0 polarity 1
dom0 kernel: xen: --> pirq=36 -> irq=36 (gsi=36)
dom0 kernel: ccp 0000:07:00.2: ccp: unable to access the device: you might be running a broken BIOS.
dom0 kernel: ccp 0000:07:00.2: tee: ring init command failed (0x00000005)
dom0 kernel: ccp 0000:07:00.2: tee: failed to init ring buffer
dom0 kernel: ccp 0000:07:00.2: tee initialization failed
dom0 kernel: ccp 0000:07:00.2: psp initialization failed

Line 4 and 5 are READ this does not look good. Any ideas on how to resolve this?

dom0 kernel: [drm] DP Alt mode state on HPD: 1

This line is yellow.

dom0 systemd[1]: Configuration file /etc/systemd/system/qubes-vm@sys-net.service.d/50_autostart.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.

Another yellow line. It is related to sys-net, but does not look al too critical to me.
8.

dom0 kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:14/LNXVIDEO:00/input/input8
dom0 kernel: acpi PNP0C14:01: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:02: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:03: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:04: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:05: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)

The last 5 lines are yellow, maybe something is broken, but i doubt it.

dom0 kernel: piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xff00, revision 15
dom0 kernel: piix4_smbus 0000:00:14.0: Using register 0x02 for SMBus port selection
dom0 kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2

I don’t know what it is, the last line is yellow.

dom0 systemd[1]: systemd-rfkill.socket: Socket service systemd-rfkill.service not loaded, refusing.
dom0 systemd[1]: Failed to listen on Load/Save RF Kill Switch Status /dev/rfkill Watch.

Both lines are RED and it is clearly wifi related. However, i doubt this is the problem. But honestly: I have no idea…

Those are all the maybe important parts of my journal.

Maybe somebody with more experience than me can spot something and help out. Would certainly appreciate it.

So what’s next to try?

I could fire up Kali live and watch to see if the wifi does something weird.

Never used xen outside of qubes, so i am hesistant to try this out, but if this is valuable i sure will.

Does anybody have more ideas on what to try/Where to look for troubleshooting information like logs or something?

PS: I have accumulated around 40 hours of uptime with the new card now but have not seen a crash yet! Maybe… Just maybe it is stable with the IO_PAGE_FAULTS… Time will tell.

Correction:
I am still using the machine with bluetooth enabled, and can see different IO_PAGE_FAULTS!

[2022-04-15 09:58:08] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8020000 flags 0x8 I
[2022-04-15 10:08:39] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8040000 flags 0x8 I
[2022-04-15 10:11:51] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8010000 flags 0x8 I
[2022-04-15 10:12:29] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8030000 flags 0x8 I
[2022-04-15 10:12:42] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8040000 flags 0x8 I
[2022-04-15 10:14:28] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8070000 flags 0x8 I
[2022-04-15 10:16:32] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8050000 flags 0x8 I
[2022-04-15 10:24:15] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8020000 flags 0x8 I

note. that those seem to originate from dom0.

lspci -vreturns

00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
	Flags: fast devsel

for this device.

Again, there is no indication of any other error in any other log in dom0.

Update:

I observed a crash! It took down sys-net, but not dom0.

What i observed:
whonix-ws tor browser being unable to load a site. After looking at the bar i noticed, that the nm-applet was not present.

I was unable to open a terminal over the qubes-applet.
qvm-run -p sys-net xterm returned: Some error with X-server (sorry, error message was flushed into oblivion by the next command).

However i was able to get an echo qvm-run -p sys-net "echo hi" and to extract the journal by
qvm-run -p "sudo journalctl". I repeated this command with an attached | cat > sysnetlog to save it.

One can see, that the bluetooth module is not working correctly and causing the “Device removed/Device available” Notifications.

Here is the dom0 log of the page faults:

[2022-04-15 11:22:28] (XEN) Freed 580kB init memory
[2022-04-15 11:58:08] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8020000 flags 0x8 I
[2022-04-15 12:08:39] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8040000 flags 0x8 I
[2022-04-15 12:11:51] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8010000 flags 0x8 I
[2022-04-15 12:12:29] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8030000 flags 0x8 I
[2022-04-15 12:12:42] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8040000 flags 0x8 I
[2022-04-15 12:14:28] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8070000 flags 0x8 I
[2022-04-15 12:16:32] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8050000 flags 0x8 I
[2022-04-15 12:24:15] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8020000 flags 0x8 I
[2022-04-15 12:28:55] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8030000 flags 0x8 I
[2022-04-15 12:30:52] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8020000 flags 0x8 I
[2022-04-15 12:34:10] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8050000 flags 0x8 I
[2022-04-15 12:37:25] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8010000 flags 0x8 I
[2022-04-15 12:42:26] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I
[2022-04-15 12:42:28] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8080000 flags 0x8 I
[2022-04-15 12:43:36] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8030000 flags 0x8 I
[2022-04-15 12:43:38] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8070000 flags 0x8 I
[2022-04-15 12:44:17] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8050000 flags 0x8 I
[2022-04-15 12:44:20] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.0 d0 addr fffffffdf8070000 flags 0x8 I

Attached is a snipped of the journal around the time of the crashes.

I cannot make sense of it, but maybe somebody else can. Should i send this to the xen-mailing list? Is this something for the linux kernel or something for intel?

Please note: This happened with enabled bluetooth. Normally (with it deactivated in the BIOS) i only get page faults for this exact device and address and not the 0000:01:00.0 ones.

[2022-04-15 12:42:26] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I

With the other wifi card this caused dom0 to crash entirely sometimes. With the new one i have not observed this over a couple of days, but maybe this will happen eventually too.

However it crashed exactly on this error and time so maybe there is information that can help troubleshoot.
sysnet.log (345.5 KB)

The most recent edition of linux-firmware broke my iwlwifi for AX210 chip AND my r8169 for my RealTek 2.5GbE ethernet port, and the previous workarounds of using kernel 5.10 and removing some firmware files no longer seems to work.

I have a hunch that it’s not playing nice with Xen.

If you absolutely MUST have networking, and don’t care about speed, literally anything cheap USB dongle 802.11ac or lower should be fine.

Oh dear god, don’t do that :joy:
The Bluetooth controller presents itself as a USB device, and if you give it to the BIOS, the BIOS will hold onto it for dear life (and apparently take the whole USB bus with it!).

No. Depending on what’s in the BIOS you flash, it may end up showing more of these.

It means that the Linux kernel devs haven’t reverse-engineered the ACPI code that your hardware manufacturer and Microsoft procured behind closed doors, and didn’t share with them…yet. But they’ll work it out eventually.

That appears to be what AMD CPUs do when running on Xen, when they resume. My 4800U does something similar.

This is more of your hardware manufacturer giving Linux support the middle finger, unfortunately…

Metaphorically speaking, of course… :grin:

This is because ccp and tee are initialised by Xen instead of Linux. I get these too.

This is fine.

This is how qubes autostarts VMs on boot. It’s fine.

More middle finger to Linux support from OEMs.

Your BIOS wouldn’t let Linux load any firmware, so it had to dummy it.

I have a feeling that has something to do with you giving the BIOS control of your Bluetooth and Wi-fi hardware… :wink:

Deep breaths. :upside_down_face:

AMD are actually trying to fix all of these, but it will take time.
If you want to keep investigating and trying different things, go for it, by all means; but it looks like a lot of this stuff requires OEMs to fix their non-free code. So you will probably have to play the waiting game… :frowning:

The wifi will likely behave completely normal. Most drivers and firmware are written under the assumption that they’re not virtualised (and don’t really like to be virtualised, either…).


Just out of curiosity, does your machine resume from S3 sleep properly?

My guess is no. Don’t try it if you want to keep your uptime streak going, because amdgpu will likely get a ring gfx timeout and force a hard reboot.

But hey, I could be (and actually WANT to be) wrong on this one :slight_smile: .

Hello alzer,

thank you so much for your help :slight_smile:

The highlighted messages make much more sense to me now.

Im really sorry to hear that both broke for you… I am on 5.10.104-3.fc32.qubes.x86_64 and the deletion trick allowed me to use the AX210, but with the described problems…

Unfortunately not. It is quite sad, but i can live with that.

You mentioned, that AMD is working on this. Should i send the log of my crash to them (or somebody else?), so they have more information on how to fix the issues?

I had random freezes and crashes on 4.0 around 1-2 times a month, and i used to live with it.
Now with 4.1 i got 4-12 crashes a day with with 8265NGW and cannot live with that.
After installing the AX210 i still get the page faults, but no complete system crash so far (couple of days with that hardware configuration).
While troubleshooting i went a few days with a USB dongle without the page faults, but i observed weird connection losses on tor and extraordinary load times like 15 seconds++ for internal web services that usually load instantly, so that isn’t a real option unless i troubleshoot that too.

So if this problem will get solved eventually, i think i will try to rock my AX210 until i observe the first sever crash/freeze. If those occur only 1-2 times a month i am fine with that. Certainly better than 4-12 times a day with the NGW card…

Again, thank you very much for your help!
After days and days of troubleshooting and trying things the suspicion grew that my hardware is faulty.
Glad to hear, that it is just the usual FOSS + OEM things.

The amazing Qubes devs regularly check the “hot topics” on this forum, and do their best to figure things out.

The issue is if it involves other developers outside the Qubes project. Sometimes bug reports get pushed upstream, but if you feel the need to, by all means, report it to AMD, Intel (the wifi), FreeDesktop (Xorg, amdgpu), kernel.org’s Bugzilla, the Xen Project.

They’re only just starting to tackle the bulk of them now.

4.1 threw in a LOT more hardware virtualisation. Yes, it’s much better and safer, IF your hardware will play nice with it. Most hardware seems to play nice, we just got unlucky. That’s all. But we’ll get there.

If you do encounter any crash, the first thing I would recommend you do in dom0 after you reboot is:

MONTH=the 3-letter abbreviation of the month the crash took place
DAY=the date of the crash
VM-NAME=whatever vm you use this forum in
HOUR=the hour the crash took place
sudo journalctl | grep "$MONTH $DAY $HOUR:" > crash.log
qvm-copy-to-vm $VM-NAME crash.log

(I’m assuming you can understand bash scripting :wink: I’m sure you can…)

That will give you a snapshot into what happened, and the log won’t be massive.

OEMs don’t seem to realise that Linux and BSD Chads are willing drop some serious cash for hardware that they can fully control. In some cases, they’re even higher-yield than hardcore Windows gamers!