AMD-Vi IO_PAGE_FAULT, crashes

baflya · April 15, 2022, 2:24pm

Problem persists.
I ordered an Intel AX210 wifi chip to test it with that and reinstalled 4.1.

Things worthy of note:

One has to disable it while installing, otherwise the installer will hang on “configuring network”
One has to apply this hack to make it work.

I think i am getting less IO_PAGE_FAULTs but it is hard to say for certain. Also i do have yet to see a hard crash with this network card.

So i tried multiple other things with the new wifi card installed:

Set kernelopts of sys-net to “iommu=ps” = no network at all
Set kernelopts of sys-net to “iommu=verbose” = no effect at all. It does not get more verbose (or where should the verbosity manifest?)
Set kernelopts of sys-net to “iommu=solft virtiotlb=around 16k memory” = no effect.
Enabling bluetooth in the BIOS = more bugs[1]

[1]: The AX210 is a dual wifi + bluetooth card. when enabling bluetooth in the BIOS, all USB ports cease function.
Additionally i get this behaviour permanently. “Device : is available, Device : is removed”.
It is very annoying, but also does throw IO_PAGE_FAULTS. As i have Bluetooth disabled usually this is not a problem for me, but maybe related.

I have seen a lot of page faults now, monitoring these log files in real time:

dom0 journalct -xef
dom0 dmesg -w
dom0 tail -f /var/log/xen/console/hypervisor.log
sys-net dmesg -w
sys-net journalctl -xef
sys-net tail -f /var/log/xen/xend.log

When the page faults happen in the /var/log/xen/console/hypervisor.log, there is nothing indicating any error in any other logs, nor is there any correlating event, service, whatever.

The page faults seem to be more likely, when using the wifi.

I once i observed another IO_PAGE_FAULT from another domain (d0), that is likely to be my SSD. Also another IO_PAGE_FAULT at around the same time from d20 to a device i was unable to pinpoint.

Does this indicate a hardware fault of my device?

I went through the journals of the boot process and noticed something that does not seem to be critical, but maybe somebody can tell me how to fix it/ if it is related:

dom0 kernel: ACPI BIOS Warning (bug): Incorrect checksum in table [BGRT] - 0x1B, should be 0x36 (20200925/tbprint-173)

After the many crashes i updated my BIOS as one of the first things i tried. Maybe this is related? Should i try to reflash the BIOS?

dom0 kernel: cpu 0 spinlock event irq 57
dom0 kernel: VPMU disabled by hypervisor.
dom0 kernel: Performance Events: PMU not available due to virtualization, using software events only.
dom0 kernel: rcu: Hierarchical SRCU implementation.
dom0 kernel: NMI watchdog: Perf NMI watchdog permanently disabled
dom0 kernel: smp: Bringing up secondary CPUs ...
dom0 kernel: installing Xen timer for CPU 1
dom0 kernel: cpu 1 spinlock event irq 67
dom0 kernel: installing Xen timer for CPU 2
dom0 kernel: cpu 2 spinlock event irq 73
dom0 kernel: installing Xen timer for CPU 3
dom0 kernel: cpu 3 spinlock event irq 79
dom0 kernel: installing Xen timer for CPU 4
dom0 kernel: cpu 4 spinlock event irq 85
dom0 kernel: installing Xen timer for CPU 5
dom0 kernel: cpu 5 spinlock event irq 91
dom0 kernel: installing Xen timer for CPU 6
dom0 kernel: cpu 6 spinlock event irq 97
dom0 kernel: installing Xen timer for CPU 7
dom0 kernel: cpu 7 spinlock event irq 103
dom0 kernel: smp: Brought up 1 node, 8 CPUs

Does not look critical to me, but it is yellow. Is this a problem?

dom0 kernel: xen:grant_table: Grant tables using version 1 layout
dom0 kernel: Grant table initialized
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL pool for atomic allocations
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
dom0 kernel: DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations

The second line is yellow as well.

dom0 kernel: ACPI: Power Button [PWRF]
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C000: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C001: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C002: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C003: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C004: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C005: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: ACPI: \_SB_.PLTF.C006: Found 3 idle states
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

The “Firmwar Bug” lines are yellow. I think this is related to the missing Power states, not working S3 or something other powerish. Maybe this is important, i don’t know.

dom0 kernel: xen: registering gsi 36 triggering 0 polarity 1
dom0 kernel: xen: --> pirq=36 -> irq=36 (gsi=36)
dom0 kernel: ccp 0000:07:00.2: ccp: unable to access the device: you might be running a broken BIOS.
dom0 kernel: ccp 0000:07:00.2: tee: ring init command failed (0x00000005)
dom0 kernel: ccp 0000:07:00.2: tee: failed to init ring buffer
dom0 kernel: ccp 0000:07:00.2: tee initialization failed
dom0 kernel: ccp 0000:07:00.2: psp initialization failed

Line 4 and 5 are READ this does not look good. Any ideas on how to resolve this?

dom0 kernel: [drm] DP Alt mode state on HPD: 1

This line is yellow.

dom0 systemd[1]: Configuration file /etc/systemd/system/qubes-vm@sys-net.service.d/50_autostart.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.

Another yellow line. It is related to sys-net, but does not look al too critical to me.
8.

dom0 kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:14/LNXVIDEO:00/input/input8
dom0 kernel: acpi PNP0C14:01: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:02: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:03: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:04: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)
dom0 kernel: acpi PNP0C14:05: duplicate WMI GUID <scrubbed> (first instance was on PNP0C14:00)

The last 5 lines are yellow, maybe something is broken, but i doubt it.

dom0 kernel: piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xff00, revision 15
dom0 kernel: piix4_smbus 0000:00:14.0: Using register 0x02 for SMBus port selection
dom0 kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2

I don’t know what it is, the last line is yellow.

dom0 systemd[1]: systemd-rfkill.socket: Socket service systemd-rfkill.service not loaded, refusing.
dom0 systemd[1]: Failed to listen on Load/Save RF Kill Switch Status /dev/rfkill Watch.

Both lines are RED and it is clearly wifi related. However, i doubt this is the problem. But honestly: I have no idea…

Those are all the maybe important parts of my journal.

Maybe somebody with more experience than me can spot something and help out. Would certainly appreciate it.

So what’s next to try?

I could fire up Kali live and watch to see if the wifi does something weird.

Never used xen outside of qubes, so i am hesistant to try this out, but if this is valuable i sure will.

Does anybody have more ideas on what to try/Where to look for troubleshooting information like logs or something?

PS: I have accumulated around 40 hours of uptime with the new card now but have not seen a crash yet! Maybe… Just maybe it is stable with the IO_PAGE_FAULTS… Time will tell.