Crash on dom0 kernel 5.15.57

After installing the latest kernel update 5.15.57-1.fc32 for dom0 after a couple of hours the whole system crashes, which did not happen with 5.15.52. Xen is at 4.14.5 for both. Coreboot version is from a branch at 2021-08-20 so I will be updating to see whether there’s been some fixes upstream.

Hardware is a System76 lemp9 which runs fairly stable otherwise (some minor problems with the intel ax201 wifi card that are known issues).

Is there a kernel/xen cmdline to persist the kernel crash log to console at least so that the system does not restart?

Update 1: I am now testing 5.18.16 with some i915 kernel parameters

Do I understand well: you think system wouldn’t restart if kernel crash log would persist to console?

No - I mean the system restarts back to bootloader upon crash, and there are no obvious errors in the usual places on disk.

Oh, thanks. Why don’t you try with kernel-latest, 5.18.16-1?

1 Like

That was a good suggestion, but unfortunately it didn’t help. I:

  • Updated BIOS to 2022_08_12_2680d93 (which ships coreboot 4.17 I believe)
  • Got a rebooting crash on the 5.15.52 kernel after 8 hours of use (I started suspecting something to do with i915 display driver because I started using a monitor with an unknown quality HDMI cable and have been seeing some screen flickering occasionally and just before the crash)
  • Got a display freeze just after logging into xfce on kernel 5.18.16

I am not using sys-gui yet so I wonder if it will help.

Update: I’ve booted again with 5.18.16 and added the following to dom0 cmdline: i915.enable_fbc=1 i915.enable_dc=0 intel_idle.max_cstate=1 ahci.mobile_lpm_policy=1 based on arch wiki i915 troubleshooting instructions. Will report back

I’d be surprised if that would help, to be honest…

Is it possible that your built-in nVIDIA card is…well…being an nVIDIA card? :stuck_out_tongue:

They should all be in the system journal. As long as your machine’s date and time are accurate, you should be able to filter out by date and time.

For example, to show everything on 5th April between 10:00 and 10:59, type:

sudo journalctl | grep "Apr 05" | grep "10:"

Well, you can get creative with your piping, filtering by keywords, date, time, kernel module name, you name it :wink:

Thanks - it’s a Intel UHD 620 graphics chip. I’ve enabled persistent journalctl entries but did not see anything there so far.

Now have some interesting kernel log entries (but no crash!) - this appears to be https://github.com/QubesOS/qubes-issues/issues/7664

dom0 kernel: BUG: Bad page map in process Xorg  pte:80000007eff0f365 pmd:103f15067
dom0 kernel: page:000000004b2cd5b3 refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0xf44d0
dom0 kernel: flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
dom0 kernel: raw: 0027ffffc0003408 ffff888142668840 ffffea0003d13440 0000000000000000
dom0 kernel: raw: 0000000000000000 00001f940000000c 00000401fffffffe 0000000000000000
dom0 kernel: page dumped because: bad pte
dom0 kernel: addr:00007d45ac41a000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888107d8f588 index:798
dom0 kernel: file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
dom0 kernel: CPU: 1 PID: 2258 Comm: Xorg Tainted: G     U            5.18.16-1.fc32.qubes.x86_64 #1
dom0 kernel: Hardware name: System76 Lemur Pro/Lemur Pro, BIOS 2022-08-12_2680d93 08/09/2022
dom0 kernel: Call Trace:
dom0 kernel:  <TASK>
dom0 kernel:  dump_stack_lvl+0x45/0x5e
dom0 kernel:  print_bad_pte.cold+0x6a/0xc5
dom0 kernel:  zap_pte_range+0x430/0x8b0
dom0 kernel:  ? __raw_callee_save_xen_pmd_val+0x11/0x22
dom0 kernel:  zap_pmd_range.isra.0+0x1b8/0x2f0
dom0 kernel:  zap_pud_range.isra.0+0xa9/0x1e0
dom0 kernel:  unmap_page_range+0x16c/0x200
dom0 kernel:  unmap_vmas+0x83/0x100
dom0 kernel:  unmap_region+0xbd/0x120
dom0 kernel:  __do_munmap+0x177/0x350
dom0 kernel:  __vm_munmap+0x75/0x120
dom0 kernel:  __x64_sys_munmap+0x17/0x20
dom0 kernel:  do_syscall_64+0x59/0x90
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  ? syscall_exit_to_user_mode+0x17/0x40
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  ? syscall_exit_to_user_mode+0x17/0x40
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  ? syscall_exit_to_user_mode+0x17/0x40
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  ? syscall_exit_to_user_mode+0x17/0x40
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  ? do_syscall_64+0x69/0x90
dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
dom0 kernel: RIP: 0033:0x7d45d88ac37b
dom0 kernel: Code: 8b 15 21 6b 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed 6a 0c 00 f7 d8 64 89 01 48
dom0 kernel: RSP: 002b:00007ffc66b2a4e8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
dom0 kernel: RAX: ffffffffffffffda RBX: 00000000000004bf RCX: 00007d45d88ac37b
dom0 kernel: RDX: 00007ffc66b2a500 RSI: 00000000004bf000 RDI: 00007d45ac41a000
dom0 kernel: RBP: 00007d45ac41a000 R08: 0000000000000008 R09: 0000000000000000
dom0 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000009
dom0 kernel: R13: 00005ff47bb30550 R14: 0000000000000063 R15: 00005ff479a27e00
dom0 kernel:  </TASK>
dom0 kernel: Disabling lock debugging due to kernel taint
dom0 kernel: BUG: Bad page map in process Xorg  pte:80000007f5768365 pmd:103f15067
dom0 kernel: page:00000000e2c49915 refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0xf44d1
dom0 kernel: flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
dom0 kernel: raw: 0027ffffc0003408 ffff888142668840 ffffea0003d13480 0000000000000000
dom0 kernel: raw: 0000000000000000 00001f930000000c 00000401fffffffe 0000000000000000
dom0 kernel: page dumped because: bad pte

Right. The Lemur Pro. Very nice hardware, actually. You said System76, and I thought Oryx. My bad :laughing:

Very nice find.

It seems to think your kernel is tainted. I’ve seen this before on my Tiger Lake machines. Older versions of Xorg don’t seem to cope well with Tiger Lake and above. Not sure why yet…

Not sure why the kernel is tainted. But even with the updated dom0 kernel params I still get hangs sometimes on 5.18.16. I am going to try and downgrade linux-firmware (because of comments in https://github.com/QubesOS/qubes-issues/issues/7648) but this could also be related to https://github.com/QubesOS/qubes-issues/issues/7513 (on the 5.15 kernel I got a crash but now it’s all silent hangs without even a black screen).

Updated to 5.19.6-1.fc32 and get a crash about a few minutes in (reproduced 3 times in a row - and reverted to the 5.18 kernel)

5.15.64-1.fc32 kernel is what I have been using stably in the past couple of days (with xen-4.14.5-7.fc32)

Now updated to 5.15.68, .64 had no significant stability issues though.

Trying more updates.

Xen bumped to 4.15.5-9

Kernel 6.0.2 crash reboots after suspend (about 10 seconds after logging in, could be GUI-related)? Some new error messages:

dom0 kernel: nvme nvme0: Shutdown timeout set to 10 seconds
dom0 kernel: i915 0000:00:02.0: [drm] [ENCODER:118:DDI C/PHY C] is disabled/in DSI mode with an ungated DDI clock, gate it
dom0 kernel: i915 0000:00:02.0: [drm] [ENCODER:102:DDI B/PHY B] is disabled/in DSI mode with an ungated DDI clock, gate it
dom0 kernel: i915 0000:00:02.0: [drm] [ENCODER:94:DDI A/PHY A] is disabled/in DSI mode with an ungated DDI clock, gate it
dom0 kernel: ACPI: EC: event unblocked
dom0 kernel: ACPI: EC: interrupt unblocked
dom0 kernel: ACPI: PM: Waking up from system sleep state S3
dom0 kernel: CPU3 is up
dom0 kernel: ACPI: \_SB_.CP03: Found 3 idle states
dom0 kernel: cpu 3 spinlock event irq 143
dom0 kernel: installing Xen timer for CPU 3
dom0 kernel: CPU2 is up
dom0 kernel: ACPI: \_SB_.CP02: Found 3 idle states
dom0 kernel: cpu 2 spinlock event irq 137
dom0 kernel: installing Xen timer for CPU 2
dom0 kernel: CPU1 is up
dom0 kernel: ACPI: \_SB_.CP01: Found 3 idle states
dom0 kernel: cpu 1 spinlock event irq 131
dom0 kernel: installing Xen timer for CPU 1
dom0 kernel: Enabling non-boot CPUs ...
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
dom0 kernel: xen_acpi_processor: Uploading Xen processor PM info
dom0 kernel: ACPI: PM: Restoring platform NVS memory
dom0 kernel: ACPI: EC: EC started
dom0 kernel: ACPI: PM: Low-level resume complete

Kernel 5.15.74 survives suspend and comes back, but:

  • new error message after suspend resume: dom0 kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=367 end=368) time 590 us, min 1073, max 1079, scanline start 1043, end 1083
  • GUI qubes need restarting because X11 windows become unresponsive (and qvm-shutdown does not work, you need to kill…)

Though some of these are probably orthogonal.

Edit 1: That i915 atomic update might be this i915 [drm] *ERROR* Atomic update failure on pipe A (#2215) · Issues · drm / intel · GitLab