Just updated kernel-latest to kernel-latest-1000:6.13.6-1.qubes.fc37 which unfortunately results in a reboot after Finished - modprobe@configfs.service - load kernel module configfs. The systemd journal ends for that boot before it gets there, I guess it’s crashing before it can write anything to it. I’ve never tried to debug a kernel crashing straight to a reboot during boot before, what other information should I provide?
I have two gpus, one amd integrated gpu on my cpu that I use for dom0 and an nvidia 4090 that I pass to an HVM qube. The GPU is at pci address 01:00.0 hence the hide_pci entry.
Do you see any crash message on the screen, even briefly? If so, you can append panic=0 to vmlinuz options and/or noreboot to xen options to keep the message longer on the screen. Or at least better see what is the last message before the crash…
You said the last line is about modprobe configfs. Can you check in a log from earlier boot what was the next module loaded? And maybe try to exclude it?
There is also a hope crash message got saved to efi pstore. Check /var/lib/systemd/pstore and /sys/fs/pstore after booting to older kernel.
Was there any issue reported earlier about this kernel version?
I’ve tried these suggestions but both pstore directories were empty and disabling reboot just left me stuck at the last message I saw previously. Obviously this is entirely unsatisfactory so I’ll see if I can tease out any further information now that I have some free time.
Even adding ignore_loglevel to my kernel command line doesn’t get me a single extra message before it freezes. I don’t know if it helps but the line just after
It seems like I really should get pstore working so I can get a look at the kernel panic, any ideas why it isn’t in the context of qubes? is it disabled by default for security reasons or something?
On a desktop system it may be easier / more reliable to get a serial port on a PCIe card (how to use it is also described in comments on that issue). But at the same time, those come with different chips and Xen doesn’t support all of them (most are fine, and in case it isn’t, it’s usually quite easy to add support for more models).
But before borrowing/buying extra stuff, try my earlier advice of checking what module gets loaded after configfs (next modprobe@… line) and try to blacklist that. If that’s really nvme, that’s unfortunate, as system won’t be very usable without its main disk. But still, may be worth trying to blacklist nvme module to confirm/reject this hypothesis.
Blacklisting the nvme module does prevent the kernel panic leaving me at a blinking cursor but I’m not sure if that is because nvme is causing the panic or just because this leaves the system rootless. The order of boot messages seems to differ a lot between 6.6.77-1 and 6.13.6 too. I realized I no longer own anything that I can use for the other end of a serial console so I’m going to have to wait for the cheap SBC I just bought to show up.
6.12.11 works, no panic. I don’t see a version of kernel-latest with 6.13.4 on the qubes repos, do you want me to compile it? Haven’t compiled any kernels for dom0 yet but if it would be useful I will.
Blacklisting xen_acpi_processor doesn’t help, kernel still panics. The log follows:
No. It was in testing repository for some time, so there was a chance you had it too. Info about 6.12.11 working should be good enough.
Anyway, I don’t have other ideas. I’ll post a bug report to kernel developers. I’ll let you know if any more info will be needed (likely full kernel log - output of sudo dmesg, and maybe some of the ACPI tables from /sys/firmware/acpi/tables).
Success, boots and no panic so far. Attached dmesg output. Was this related to Xen lacking support for the frequency scaling features of Zen 4 CPUs or is that a coincidence? 6.13.8-dmesg.tar.gz (29.8 KB)
Edit: I saw your message on the xen-devel mailing list and subsequently noticed dom0 is lacking a /sys/devices/system/cpu/cpu0/cpufreq/ directory as mentioned here
Do you care to have your test report be included in the fix? If so, respond to [PATCH v1 00/10] cpufreq: cpufreq_update_limits() fix and some cleanups (see “reply” link with instructions) with the Tested-by tag or tell me what you’d like to put there. The tag usually is in form Tested-by: Full name <email>. If you prefer to not share, that’s fine too.