Random crashes, need help troubleshooting

Hello,

i am experiencing many random crashes on my ThinkPad L14 G2 with a Ryzen 7 PRO 5850U + a no name USB-C dock. There are 4 Monitors: Integrated, Internal HDMI and 2 HDMI from the dock. A mouse and keyboard are attached to the dock behind a KVM switch.

System: Qubes 4.1, Kernel: 5.10.104-3

There is no indication for errors in any logs (at least i have not found them, more on this at the end)

When it happens:
Comletely unpredictable, no pattern observable. I have not found a reliable way to cause a crash reliably but i think it is slightly more likely to happen when starting many qubes as fast as possible right after booting. Or maybe not…

How often?
Varies quite a bit. Sometimes it runs stable for many hours, sometimes it crashes within minutes. Crashed 13 times yesterday and 4 times today (so far).

How is it crashing?
Everything freezes for about 2 seconds, then unfreezes for 2 seconds, then go black and reboot.
One time all external monitors went black for 1 second then everything freezes.

Some notes:

  • This did not happen on 4.0 as far as i can tell. Got the dock two weeks before upgrading to 4.1, so there is a very tiny chance of this happening on 4.0 too, but i honestly do not think so.

  • When booting with the dock attached, i cannot enter my FDE password on the GUI. When switching to TTY i see 4 chars being entered there before i type anything. Deleting them and entering my password there works fine, but only with integrated keyboard. This does not occur without the dock.

  • Sometimes when booting with the dock attached, the ‘E’ key is virtually stuck in a downwards position. disconnecting the keyboard or switching the KVM switch interrupts the ‘E-spree’ but it continues on reconnecting/switching back (sometimes). This happens with around a 10% chance on booting.

  • On 4.0 it was pretty stable, however i observed two weird behaviors:

  1. System freezing. Very infrequently, usually around half a month of consecutive uptime.
  2. virtually stuck keys: Sometimes keys on the keyboard seems to get stuck and are pressed indefinitely until i reboot. This also happened around equally frequent, so twice a month.
    (As those problems where that infrequent, i have not investigated the cause)
  • Some naturally broken stuff like network manager not connecting to the wifi sometimes, changing monitor enumeration on reconnect, not working hiberation, crap battery runtime, broken CPU frequency scaling, and so on.

What i have checked
I have a dmesg -Tw window open at all times but was unable to see anything happening before it freezes.
Took a look at journalctl after the crashes but nothing is happening upon the crashes.
Here is the crash while writing this and reboot:

Apr 07 17:05:58 dom0 qrexec-policy-daemon[5064]: qrexec: whonix.SdwdateStatus+: sys-whonix -> disp9438: allowed to disp9438
Apr 07 17:05:58 dom0 qrexec-policy-daemon[5064]: qrexec: whonix.SdwdateStatus+: sys-whonix -> disp3981: allowed to disp3981
Apr 07 17:05:58 dom0 qrexec-policy-daemon[5064]: qrexec: whonix.SdwdateStatus+: sys-whonix -> disp5343: allowed to disp5343
-- Reboot --
Apr 07 17:07:00 dom0 kernel: Linux version 5.10.104-3.fc32.qubes.x86_64 (mockbuild@build-fedora4) (gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1), GNU ld version 2.34-6.fc32) #1 SMP Fri Mar 1>
Apr 07 17:07:00 dom0 kernel: Command line: placeholder root=/dev/mapper/qubes_dom0-root ro rd.luks.uid=luks-<scrubbed> rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_d>
Apr 07 17:07:00 dom0 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

Tried finding the xen logs (tl dmesg-command) but comamnd not found. Here is where i got the idea from, maybe i am doing it wrong tho?

Tried to read the /var/log/xen/xend.log as mentioned here but it is not there.

In /var/log/xen/console/hypervisor.log there is nothing special, but some IO-Page faults (those are normal right? I have a lot of those. All of them are like in this example. This is the crash and the reboot:

[2022-04-07 17:04:23] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I
[2022-04-07 17:04:59] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I
[2022-04-07 17:05:38] (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0 d1 addr fffffffdf8000000 flags 0x8 I
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^>
[2022-04-07 17:09:15] (XEN) Built-in command line: ept=exec-sp
[2022-04-07 17:09:15] (XEN) parameter "no-real-mode" unknown!
[2022-04-07 17:09:15]  Xen 4.14.4

So what is next?

Maybe it has something to do with the dock, but in this case i would assume that this behavior should have happened on 4.0 as well.

Maybe my RAM is bad? But again: I would have noticed much more crashes and freezes on 4.0 then.

I will try to run it without the dock and see how this goes. If it still fails, i will run a memtest.

In the meantime, is there anything more i can do/troubleshoot?

Thanks for your help in advance :slight_smile:

First hint : why do you have ept=exec-sp in your Xen command line ?
From the Xen manual, it’s only for Intel CPUs (see Xen’s man page: xen-command-line).
You should remove it.
Why do you have parameter "no-real-mode" unknown! ?
What’s your exact Xen command line ?

What’s the 0000:03:00.0 device (lspci -v -s 0:03:00.0) ?
IO-Page faults are normal, not really ^^ I never had this on my AMD Ryzen or Athlon dom0s.
Also, the “d1” -may- be the first VM started, but not sure.

Dumb question, have you tried to “dry-air” the keyboard ?
I recently had a computer which wouldn’t boot because of dust in the power button …
That’s certainly not the cause of the crashes, but it cannot be bad to do it ^^

1 Like

Thank you so much for your help!

I have no idea… I have not tinkered with anything xen command related, but some frequency-scaling stuff.

My command line is:

Command line: placeholder console=none dom0_mem=min:1024M dom0_mem=max:4096M ucode=scan smt=off gnttab_max_frames=2048 gnttab_max_maptrack_frames=4096 no-real-mode edd=off

I can remove the no-real-mode in the /boot/efi/EFI/grub.cfg and see how it goes. But as it is unknown i don’t think it has any effect.

Ahh yeah i completely missed the “IO” part. Thought “Probably the page is just swapped out” or something.

[Edit: This is wrong. It is not the dock, see next post.]

Indeed! My machine has 2 USB-C Ports. I just assume this is the one i use to plug in my dock.

07:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir USB 3.1
07:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir USB 3.1

I will try to connect my dock to the other and see if the page faults change to it then. I don’t think this solves the problem, as it also occurs there iirc but at least that would further the suspicion, that the dock ist just faulty.

I don’t know what the d1 part is, but i don’t have a qube with that name.

Nope, i do not have compressed air handy but will try to get some.
But i do not have much hopes because:
So there are two “stuck key” scenarios here:

  1. The internal keyboard: The problem occurred since i bought the Laptop new, it happens to different keys and i have since switched the keyboard assembly without this improving.
  2. The external keyboard: I use this keyboard for another machine via a KVM switch and on this system i have never seen this behaviour. Also it is always the “E” key that is “stuck down”, but without me pressing it before that.

But yeah, i think blowing out the dust might be a good idea anyways :slight_smile:

So again: Thank you so much, armed with this information i can investigate further.

Nope it is not the dock, it is my wifi.

lscpi -v -s 0:03:00.0

	Subsystem: Intel Corporation Dual Band Wireless-AC 8265
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at fd600000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number <scrubbed>
	Capabilities: [14c] Latency Tolerance Reporting
	Capabilities: [154] L1 PM Substates
	Kernel driver in use: pciback
	Kernel modules: iwlwifi

Switching the ports did not help either. Just had another crash :frowning:.

I will try to not use the dock now to see if this really is dock related or not.

Really, you should investigate those two

[2022-04-07 17:09:15] (XEN) Built-in command line: ept=exec-sp
[2022-04-07 17:09:15] (XEN) parameter "no-real-mode" unknown!

Carefully read both explanations on the xen-manpages :

Simply put : don’t use them.
You tried freq-scaling stuff, what exactly ? Remove your tests.
Try using/re-installing the Xen version provided by Qubes, and the stock command line.

Xen affects number to machines internally. d1 is the first domU/VM started (use xl list to get the id).

Try not using the wifi card.
You could try other distros, like live ones, to see if those bugs happen too.
You can even try a live distro that can be persistent, and on which you install Xen.

When debugging a problem you should remove everything first, then add things.

BTW, just read that post : QSB-079: Two IOMMU-related Xen issues (XSA-399, XSA-400)

Have you updated already ? It may be as simple as that ^^

Thank you, will do this.

I tried other governors with xenpm set-scaling-governor.

Additionally, i may have tried (cannot find evidence in the .bash_historys, so it probably was on one installation earlier)

  • (unsuccessfully) modprobing some scaling modules,
  • Temporarily chaning start options with to let dom0 handle the scaling

I have removed any occurences of no-real-mode in my /boot/efi/EFI/qubes/grub.cfg, however there was no occurrence of “ept”, “exec” or “sp” in it, so i added ept=no-exec-sp to all xen_opts_rm (at the place no-real-mode was).

On a reboot no-real-mode is gone now, but ept=exec-sp not. Maybe. I don’t know.
Here is the /var/log/xen/console/hypervisor.log of the reboot:

[2022-04-07 220:03:50] Logfile Opened
[2022-04-07 20:03:50] (XEN) Built-in command line: ept=exec-sp
[2022-04-07 20:03:50]  Xen 4.14.4
[2022-04-07 20:03:50] (XEN) Xen version 4.14.4 (mockbuild@[unknown]) (gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1)) debug=n  Wed Mar  9 00:00:00 UTC 2022
[2022-04-07 20:03:50] (XEN) Latest ChangeSet:
[2022-04-07 20:03:50] (XEN) Bootloader: GRUB 2.04
[2022-04-07 20:03:50] (XEN) Command line: placeholder console=none dom0_mem=min:1024M dom0_mem=max:4096M ucode=scan smt=off gnttab_max_frames=2048 gnttab_max_maptrack_frames=4096 ept=no-exec-sp edd=off

It does not complain about no knowing how to interpret “ept=no-exec-sp” like it did when i “turned it of the wrong way”, but it is still there in the first line.

I really wonder where it comes from, as definitively did not change anything related to that. Especially as i am on AMD that does not have this stuff to begin with.

On a side note:
Had the “Stuck E” problem without the dock, but with my external keyboard, the USB-Hub (that i forgot about) and the KVM switch. When this fault occurs, the error that i cannot enter the FDE password in the GUI does too so this seems related. Will try to pinpoint ffurther:stuck_out_tongue:

As this would make things really uncomfortable for me i opt to keep using it and see if taking out the dock did the trick with the crashes. Should it crash again i will physically detach it.

Again, thank you very much for your help!

First, I dunno if you have read my answer about the XSA, try updating first ?

Never tried this ! Can you elaborate and/or give reference docs ? I always like to learn new things !

For future reference there’s a typo, it’s xen_rm_opts. But why are you modifying this there ?
I couldn’t find a ref to this in Xen manual, so I guess it’s a GRUB thing ?
AFAIK in Qubes, Xen command line should be modified in “/etc/default/grub”, the part “GRUB_CMDLINE_XEN_DEFAULT” ?

Also, “ept=no-exec-sp” shouldn’t work in the cmdline, it’s only a xl option that you should run after boot, so “at runtime”. As per the man page :

1.2.78 ept

    = List of [ ad=<bool>, pml=<bool>, exec-sp=<bool> ]
[...]
[The exec-sp boolean] may be modified at runtime using xl set-parameters ept=[no-]exec-sp to switch between fast and secure

Another strange thing is “(XEN) Built-in command line: ept=exec-sp”. I guess Qubes OS makes that option by default, but I don’t know where. I also have this in my logs and I have a Ryzen.
Maybe I was wrong after all and it’s not so important : if not on Intel CPU it just ignores it ?
At least on my virtualized Qubes I have no problem, but I can’t use PCI-PT so who knows.

I said to not use the Wifi card because the errors seemed to come from it (AMD-Vi: IO_PAGE_FAULT: 0000:03:00.0) !
I know it’s not practical but maybe remove it before not using the computer and leave it running ?
Just to be sure, is the Wifi card assigned to the 1st domU you start (maybe netVM or sys-usb) ?

Good luck with the keyboard thing, that looks strange … ^^

So Update: My Laptop crashes without the dock attached, so it is out of the picture. Now wifi is the prime subject.

I had not done that at that point, but updated after you mentioning it. After that i waited a few hours if it would crash again. Unfortunately it did.

I had problems with battery life, loud fans and high temperatures, so i decided to try save some power. This issie is where my journey started. On y CPU i suspect that scaling is broken, P states are missing, and reported frequencies are bugged out like others in this issue and the linked one. Here is the corresponding documentation about xen power management. You can change the governor xen uses with xenpm set-scaling-governor <governor> but as far as i am concerned, it does not do much, if it does anything… I am not finished with investigating this yet, as the crashing is much more important right now.

Yeah, might be a typo… So i have to admit: i am not a grub person… Always avoided dealing with the bootleader, so i don’t really have a cloud what i am doing there lol.

On my old 4.0 the /boot/efi/EFI/qubes/grub.cfg was pretty straight forward, like here. Some menus, and one can modify the command and kernal options easily. Now with 4.1 my grub.cfg is a very long shell script with 5 or so places where those commands are set. As i am hesistant to really get behind how that works, i just changed all places that seemed fitting.

Yes, i tried ept=exec-sp=off but this resulted in “unkonw command”. So i am not sure why this is not working or what the correct syntax to disable it is then…

Yup… I should have listened… Was running 4.0 a while without those problems, so i figured it might not be the issue. It is connected to my sys-net on startup.

Now i have disabled it in my UEFI and attached a USB wifi dongle temporarily. The IO page faults are gone, but i don’t have much uptime in this configuration to say that it is stable now. Time will tell.

Again, thank you very much for your help :slight_smile:

Update:
I have 24h of crash free uptime now, so i think this is the cause.

Will create a new thread, i think this is more fitting now.

Thanks for this ! The github thread is about Intel CPUs, dunno how that applies to AMD CPUs, but it gives nice hints.
On my Debian dom0, the ondemand scaling works OOTB on a Ryzen 1700x Pro, but I can’t enable Turbo mode. Need to test it on my Qubes.

I’m not using EFI boot so can’t help you, but on my Qubes the way I showed you works fine, only one line to change.

It should be xl set-parameters ept=exec-sp OR xl set-parameters ept=no-exec-sp

And I think sys-net is started first on startup, that’s why you had “d1” in the error log.

So it’s effectively a problem with your Wifi card. At least you know where it comes from !

I’ve read it, you may also test “iommu=verbose”, even if I don’t think I can help with the output, it may gives enough info to others.
I’ve also read a few threads where Intel wifi cards don’t play nicely, but it was about AX200, dunno which one is yours.
You could also try a live distro with Xen, to see if it’s a problem with Qubes or Xen/kernel.
You may also ask on the Xen mailing list. It seems more a general problem than Qubes-related, but who knows …
Qubes makes it hard (but possible) to test network adapters in dom0, so you could also try this. Not a good advice security-wise though ^^

No prob, thanks for the new info !
And good luck.

@baflya Would you be so kind as to indicate the make and model of your dock? Also, did you have to install any drivers?

I’m having a hard time getting any kind of definitive answer as to which 2+ monitor docks work well with Qubes OS.

I bought the cheapest dock that had 2 HDMI ports i could find. It is a Selore brand dock, SEUC3310.

Absolutely no driver or firmware installation necessary.