Kernel-latest 6.13.6 boot loop

GM0x · March 19, 2025, 5:39am

Just updated kernel-latest to kernel-latest-1000:6.13.6-1.qubes.fc37 which unfortunately results in a reboot after Finished - modprobe@configfs.service - load kernel module configfs. The systemd journal ends for that boot before it gets there, I guess it’s crashing before it can write anything to it. I’ve never tried to debug a kernel crashing straight to a reboot during boot before, what other information should I provide?

AMD Ryzen 9 7950X3D, Qubes 4.2
Kernel command line: placeholder root=/dev/mapper/qubes_dom0-root ro rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_dom0/swap plymouth.ignore-serial-consoles 6.13.6-1.qubes.fc37.x86_64 x86_64 usbcore.authorized_default=0 rd.qubes.dom0_usb=56:00.0 rd.qubes.hide_pci=01:00.0,01:00.1 nouveau.modeset=0 rd.driver.blacklist=nouveau module_blacklist=nouveau mce=bootlog pcie_aspm=off

I have two gpus, one amd integrated gpu on my cpu that I use for dom0 and an nvidia 4090 that I pass to an HVM qube. The GPU is at pci address 01:00.0 hence the hide_pci entry.

fdhhjigf · March 19, 2025, 7:36am

I think this kernel was supposed to be superseded only but made into stable by error??? @marmarek

marmarek · March 19, 2025, 8:51pm

Do you see any crash message on the screen, even briefly? If so, you can append panic=0 to vmlinuz options and/or noreboot to xen options to keep the message longer on the screen. Or at least better see what is the last message before the crash…

You said the last line is about modprobe configfs. Can you check in a log from earlier boot what was the next module loaded? And maybe try to exclude it?

There is also a hope crash message got saved to efi pstore. Check /var/lib/systemd/pstore and /sys/fs/pstore after booting to older kernel.

Was there any issue reported earlier about this kernel version?

fdhhjigf · March 20, 2025, 4:36am

I updated my system to 6.13.6 and no issues whatsoever.

GM0x · March 20, 2025, 8:05pm

I’ve tried these suggestions but both pstore directories were empty and disabling reboot just left me stuck at the last message I saw previously. Obviously this is entirely unsatisfactory so I’ll see if I can tease out any further information now that I have some free time.

catacombs · March 20, 2025, 8:48pm

I had a boot loop when I miss-entered the disc password. not sure which Kernel.

If there is a mistake to be made. I will do it.

Edit

I miss typed the last character of the disc password

GM0x · March 20, 2025, 9:45pm

Even adding ignore_loglevel to my kernel command line doesn’t get me a single extra message before it freezes. I don’t know if it helps but the line just after

Mar 20 17:39:02 dom0 systemd[1]: Finished modprobe@configfs.service - Load Kernel Module configfs.

in a successful (older kernel) boot is

Mar 20 17:39:02 dom0 kernel:  nvme0n1: p1 p2 p3

GM0x · March 20, 2025, 10:03pm

It seems like I really should get pstore working so I can get a look at the kernel panic, any ideas why it isn’t in the context of qubes? is it disabled by default for security reasons or something?

marmarek · March 21, 2025, 1:48am

Any chance for a serial console on this system? Or maybe USB3 debug cable? That issue has also other debugging ideas.

GM0x · March 21, 2025, 3:05am

Sadly no serial port or header on this mobo (asrock x670 taichi) as far as I know, I’ll take a look at the usb3 debug cable option

marmarek · March 21, 2025, 3:41am

On a desktop system it may be easier / more reliable to get a serial port on a PCIe card (how to use it is also described in comments on that issue). But at the same time, those come with different chips and Xen doesn’t support all of them (most are fine, and in case it isn’t, it’s usually quite easy to add support for more models).

But before borrowing/buying extra stuff, try my earlier advice of checking what module gets loaded after configfs (next modprobe@… line) and try to blacklist that. If that’s really nvme, that’s unfortunate, as system won’t be very usable without its main disk. But still, may be worth trying to blacklist nvme module to confirm/reject this hypothesis.

GM0x · March 21, 2025, 6:20am

Blacklisting the nvme module does prevent the kernel panic leaving me at a blinking cursor but I’m not sure if that is because nvme is causing the panic or just because this leaves the system rootless. The order of boot messages seems to differ a lot between 6.6.77-1 and 6.13.6 too. I realized I no longer own anything that I can use for the other end of a serial console so I’m going to have to wait for the cheap SBC I just bought to show up.

GM0x · March 24, 2025, 5:59am

@marmarek I managed to capture the kernel panic using the USB3 debug port method:

Panic

[    9.120051] systemd[1]: Mounted sys-kernel-tracing.mount - Kernel Trace File System.
[    9.121801] systemd[1]: Started systemd-journald.service - Journal Service.
[    9.225421] EXT4-fs (dm-3): re-mounted d7b88cf4-0823-4022-8451-0974d0a35c8c r/w. Quota mode: none.
[    9.240847] systemd-journald[929]: Received client request to flush runtime journal.
[    9.244236] systemd-journald[929]: File /var/log/journal/59002fca6860478d829cd93fb2a2748f/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    9.367048] BUG: kernel NULL pointer dereference, address: 0000000000000070
[    9.368251] #PF: supervisor read access in kernel mode
[    9.369273] #PF: error_code(0x0000) - not-present page
[    9.370346] PGD 0 P4D 0 
[    9.371222] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    9.372114] CPU: 0 UID: 0 PID: 128 Comm: kworker/0:2 Not tainted 6.13.6-1.qubes.fc37.x86_64 #1
[    9.373184] Hardware name: ASRock X670E Taichi/X670E Taichi, BIOS 3.20 02/21/2025
[    9.374183] Workqueue: kacpi_notify acpi_os_execute_deferred
[    9.375124] RIP: e030:cpufreq_update_limits+0x10/0x30
[    9.375840] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
[    9.377009] RSP: e02b:ffffc9004058be28 EFLAGS: 00010246
[    9.377667] RAX: 0000000000000000 RBX: ffff888005bf4800 RCX: ffff88805d635fa8
[    9.378415] RDX: ffff888005bf4800 RSI: 0000000000000085 RDI: 0000000000000000
[    9.379127] RBP: ffff888005cd7800 R08: 0000000000000000 R09: 8080808080808080
[    9.379887] R10: ffff88800391abc0 R11: fefefefefefefeff R12: ffff888004e8aa00
[    9.380669] R13: ffff88805d635f80 R14: ffff888004e8aa15 R15: ffff8880059baf00
[    9.381514] FS:  0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
[    9.382345] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.383045] CR2: 0000000000000070 CR3: 000000000202c000 CR4: 0000000000050660
[    9.383786] Call Trace:
[    9.384335]  <TASK>
[    9.384886]  ? __die+0x23/0x70
[    9.385456]  ? page_fault_oops+0x95/0x190
[    9.386036]  ? exc_page_fault+0x76/0x190
[    9.386636]  ? asm_exc_page_fault+0x26/0x30
[    9.387215]  ? cpufreq_update_limits+0x10/0x30
[    9.387805]  acpi_processor_notify.part.0+0x79/0x150
[    9.388402]  acpi_ev_notify_dispatch+0x4b/0x80
[    9.389013]  acpi_os_execute_deferred+0x1a/0x30
[    9.389610]  process_one_work+0x186/0x3b0
[    9.390205]  worker_thread+0x251/0x360
[    9.390765]  ? srso_alias_return_thunk+0x5/0xfbef5
[    9.391376]  ? __pfx_worker_thread+0x10/0x10
[    9.391957]  kthread+0xd2/0x100
[    9.392493]  ? __pfx_kthread+0x10/0x10
[    9.393043]  ret_from_fork+0x34/0x50
[    9.393575]  ? __pfx_kthread+0x10/0x10
[    9.394090]  ret_from_fork_asm+0x1a/0x30
[    9.394621]  </TASK>
[    9.395106] Modules linked in: gpio_generic amd_3d_vcache acpi_pad(-) loop fuse xenfs dm_thin_pool dm_persistent_data dm_bio_prison amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul drm_exec crc32_pclmul gpu_sched crc32c_intel drm_suballoc_helper polyval_clmulni drm_panel_backlight_quirks polyval_generic drm_buddy ghash_clmulni_intel sha512_ssse3 drm_display_helper sha256_ssse3 sha1_ssse3 xhci_pci cec nvme sp5100_tco xhci_hcd nvme_core nvme_auth video wmi xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
[    9.398698] CR2: 0000000000000070
[    9.399266] ---[ end trace 0000000000000000 ]---
[    9.399880] RIP: e030:cpufreq_update_limits+0x10/0x30
[    9.400528] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
[    9.401673] RSP: e02b:ffffc9004058be28 EFLAGS: 00010246
[    9.402316] RAX: 0000000000000000 RBX: ffff888005bf4800 RCX: ffff88805d635fa8
[    9.403060] RDX: ffff888005bf4800 RSI: 0000000000000085 RDI: 0000000000000000
[    9.403819] RBP: ffff888005cd7800 R08: 0000000000000000 R09: 8080808080808080
[    9.404581] R10: ffff88800391abc0 R11: fefefefefefefeff R12: ffff888004e8aa00
[    9.405332] R13: ffff88805d635f80 R14: ffff888004e8aa15 R15: ffff8880059baf00
[    9.406063] FS:  0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
[    9.406830] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.407561] CR2: 0000000000000070 CR3: 000000000202c000 CR4: 0000000000050660
[    9.408318] Kernel panic - not syncing: Fatal exception
[    9.409022] Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: 'noreboot' set - not rebooting.

Let me know if you need any other information.

marmarek · March 24, 2025, 11:54am

Two things:

What was the last working kernel version? 6.12.11 ? 6.13.4?
Try blacklisting xen_acpi_processor module - does it help?

GM0x · March 24, 2025, 8:03pm

6.12.11 works, no panic. I don’t see a version of kernel-latest with 6.13.4 on the qubes repos, do you want me to compile it? Haven’t compiled any kernels for dom0 yet but if it would be useful I will.
Blacklisting xen_acpi_processor doesn’t help, kernel still panics. The log follows:

6.13.6 w/ xen_acpi_processor blacklisted

[    8.601993] systemd[1]: Starting systemd-modules-load.service - Load Kernel Modules...
[    8.603299] fuse: init (API version 7.41)
[    8.603315] loop: module loaded
[    8.603822] systemd[1]: Starting systemd-network-generator.service - Generate network units from Kernel command line...
[    8.606983] systemd[1]: Starting systemd-remount-fs.service - Remount Root and Kernel File Systems...
[    8.609029] systemd[1]: Starting systemd-udev-trigger.service - Coldplug All udev Devices...
[    8.610384] Module xen_acpi_processor is blacklisted
[    8.613341] systemd[1]: Activated swap dev-mapper-qubes_dom0\x2dswap.swap - /dev/mapper/qubes_dom0-swap.
[    8.614973] systemd[1]: Mounted dev-mqueue.mount - POSIX Message Queue File System.
[    8.615739] systemd[1]: Mounted proc-xen.mount - Mount /proc/xen files.
[    8.616856] systemd[1]: Started systemd-journald.service - Journal Service.
[    8.737501] EXT4-fs (dm-3): re-mounted d7b88cf4-0823-4022-8451-0974d0a35c8c r/w. Quota mode: none.
[    8.747078] systemd-journald[918]: Received client request to flush runtime journal.
[    8.749358] systemd-journald[918]: File /var/log/journal/59002fca6860478d829cd93fb2a2748f/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    8.844869] BUG: kernel NULL pointer dereference, address: 0000000000000070
[    8.845716] #PF: supervisor read access in kernel mode
[    8.846271] #PF: error_code(0x0000) - not-present page
[    8.846807] PGD 0 P4D 0 
[    8.847280] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    8.847796] CPU: 0 UID: 0 PID: 796 Comm: kworker/0:2 Not tainted 6.13.6-1.qubes.fc37.x86_64 #1
[    8.848407] Hardware name: ASRock X670E Taichi/X670E Taichi, BIOS 3.20 02/21/2025
[    8.848982] Workqueue: kacpi_notify acpi_os_execute_deferred
[    8.849537] RIP: e030:cpufreq_update_limits+0x10/0x30
[    8.850067] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
[    8.850915] RSP: e02b:ffffc9004069fe28 EFLAGS: 00010246
[    8.851443] RAX: 0000000000000000 RBX: ffff888005b64800 RCX: ffff88805d635fa8
[    8.852037] RDX: ffff888005b64800 RSI: 0000000000000085 RDI: 0000000000000000
[    8.852608] RBP: ffff888005361000 R08: 0000000000000000 R09: 8080808080808080
[    8.853195] R10: ffff888005ab0680 R11: fefefefefefefeff R12: ffff8880053eea00
[    8.853837] R13: ffff88805d635f80 R14: ffff8880053eea15 R15: ffff88800bebaf00
[    8.854432] FS:  0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
[    8.855042] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.855595] CR2: 0000000000000070 CR3: 000000001c7f2000 CR4: 0000000000050660
[    8.856196] Call Trace:
[    8.856671]  <TASK>
[    8.857153]  ? __die+0x23/0x70
[    8.857632]  ? page_fault_oops+0x95/0x190
[    8.858132]  ? exc_page_fault+0x76/0x190
[    8.858638]  ? asm_exc_page_fault+0x26/0x30
[    8.859135]  ? cpufreq_update_limits+0x10/0x30
[    8.859640]  acpi_processor_notify.part.0+0x79/0x150
[    8.860149]  acpi_ev_notify_dispatch+0x4b/0x80
[    8.860662]  acpi_os_execute_deferred+0x1a/0x30
[    8.861215]  process_one_work+0x186/0x3b0
[    8.861759]  worker_thread+0x251/0x360
[    8.862255]  ? srso_alias_return_thunk+0x5/0xfbef5
[    8.862742]  ? __pfx_worker_thread+0x10/0x10
[    8.863219]  kthread+0xd2/0x100
[    8.863670]  ? __pfx_kthread+0x10/0x10
[    8.864140]  ret_from_fork+0x34/0x50
[    8.864596]  ? __pfx_kthread+0x10/0x10
[    8.865052]  ret_from_fork_asm+0x1a/0x30
[    8.865517]  </TASK>
[    8.865963] Modules linked in: fjes(+) gpio_amdpt(+) amd_3d_vcache gpio_generic acpi_pad(-) loop fuse xenfs dm_thin_pool dm_persistent_data dm_bio_prison amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul drm_exec crc32c_intel gpu_sched polyval_clmulni drm_suballoc_helper polyval_generic drm_panel_backlight_quirks ghash_clmulni_intel drm_buddy sha512_ssse3 nvme sha256_ssse3 drm_display_helper xhci_pci sha1_ssse3 nvme_core cec sp5100_tco nvme_auth xhci_hcd video wmi xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
[    8.868413] CR2: 0000000000000070
[    8.868895] ---[ end trace 0000000000000000 ]---
[    8.869433] RIP: e030:cpufreq_update_limits+0x10/0x30
[    8.870018] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
[    8.870863] RSP: e02b:ffffc9004069fe28 EFLAGS: 00010246
[    8.871440] RAX: 0000000000000000 RBX: ffff888005b64800 RCX: ffff88805d635fa8
[    8.872061] RDX: ffff888005b64800 RSI: 0000000000000085 RDI: 0000000000000000
[    8.872678] RBP: ffff888005361000 R08: 0000000000000000 R09: 8080808080808080
[    8.873276] R10: ffff888005ab0680 R11: fefefefefefefeff R12: ffff8880053eea00
[    8.873918] R13: ffff88805d635f80 R14: ffff8880053eea15 R15: ffff88800bebaf00
[    8.874540] FS:  0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
[    8.875154] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.875725] CR2: 0000000000000070 CR3: 000000001c7f2000 CR4: 0000000000050660
[    8.876349] Kernel panic - not syncing: Fatal exception
[    8.876901] Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: 'noreboot' set - not rebooting.

marmarek · March 25, 2025, 1:55pm

No. It was in testing repository for some time, so there was a chance you had it too. Info about 6.12.11 working should be good enough.

Anyway, I don’t have other ideas. I’ll post a bug report to kernel developers. I’ll let you know if any more info will be needed (likely full kernel log - output of sudo dmesg, and maybe some of the ACPI tables from /sys/firmware/acpi/tables).

marmarek · March 27, 2025, 12:29pm

There is a tentative fix at Apply fix for cpufreq crash on AMD by marmarek · Pull Request #1085 · QubesOS/qubes-linux-kernel · GitHub

I uploaded it to the unstable repository. You can install it with:

qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update kernel-latest

You should get kernel-latest-6.13.8-1.qubes.1.fc37.x86_64

GM0x · March 27, 2025, 6:51pm

Success, boots and no panic so far. Attached dmesg output. Was this related to Xen lacking support for the frequency scaling features of Zen 4 CPUs or is that a coincidence?
6.13.8-dmesg.tar.gz (29.8 KB)

Edit: I saw your message on the xen-devel mailing list and subsequently noticed dom0 is lacking a /sys/devices/system/cpu/cpu0/cpufreq/ directory as mentioned here

marmarek · March 31, 2025, 1:32pm

Do you care to have your test report be included in the fix? If so, respond to [PATCH v1 00/10] cpufreq: cpufreq_update_limits() fix and some cleanups (see “reply” link with instructions) with the Tested-by tag or tell me what you’d like to put there. The tag usually is in form Tested-by: Full name <email>. If you prefer to not share, that’s fine too.

GM0x · April 2, 2025, 7:46pm

I’d prefer not to, thanks though.