How to debug system freezing? (Please help)

Dear experts,

I’m rather desperate… I’m experiencing random freezes with QubesOS several times a day and as you can imagine, this is rather frustrating when you’re relying on a system as daily driver.

In the past weeks I already tried to find ideas online on what to do and how to debug it and I tried to find patterns but somehow couldn’t find anything useful.
(latest Kernel didn’t help; reinstallation with BTRFS didn’t help; freeze still occurs even if only Dom0 is running; no errors showing up in journalctl around the time of the freeze;…)

I really hope someone can help me!

P.S.: I already asked for help here and filed a bug report here but unfortunately didn’t get useful replies yet.

have you tried Qubes OS 4.1

this is so strange, i can’t understand this

Do you happen to run a Zen APU? I had a fair amount of freezes before I have applied this:

did you read the post ? the issue is quite different from you (just mouse and keyboard in both usb and ps2 freeze, xfce clock still work, no xorg crash)

I was indeed considering this but didn’t go through with it because I assumed 4.0 to be more stable than 4.1.

That’s why I did the Dom0 kernel update to 5.13.6-1 instead in the hope that this might show some improvement. But unfortunately I didn’t see any change there.

Thanks for the suggestion. I have a AMD Ryzen 7 5800X, so not an APU with integrated graphics.

Most of the freezes that I observed were indeed like that. But in the last days I also observed freezes where the clock was also frozen (I’ll keep monitoring this) - I already updated the issue in github accordingly.

In all freezing cases, the PS2 keyboard LED didn’t work anymore.

@wind.gmbh: Which log files did you monitor for observing your issue? So far I mainly looked at journalctl -p 0..4 -x, assuming that such issues would be shown there.

if you not so serious about security, it fine to use, it doesn’t have to much bug (i’m very serious about security so i don’t use it but i can install it without workaround on my computer :grinning:)


Edit: after some in-depth check about log, only thing i found somewhat related is

Sep 23 16:09:25 dom0 kernel: i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp

when i search this, it often come with some problem that make ps2 keyboard and/or mouse not responding

Update: I now noticed that my Qubes 4.0 installation on another SSD doesn’t seem to be affected by those freezes (at least I didn’t get one all day). Both setups are basically identical (BTRFS file system; running on exactly the same PC; …). Only difference seems to be the type of SSD. I’m not aware that anything else is different.

With this SSD I’m experiencing the freezes: Samsung 870 EVO 4TB
With this SSD I didn’t experience the freezes yet: Samsung 860 EVO 2TB

I’ll continue testing this assumption.

Thanks for that @qpost135!

With this SSD I’m experiencing the freezes: Samsung 870 EVO 4TB

I am using a Samsung 870 QVO 2TB SSD. Turns out I should have been doing some reading instead of just taking one of the BestBuy shelf. Looks like I bought cheap crap and have no reason to complain.

Will get a better SSD not only for this issue but the 1,000 write cycles scare the crap out of me

1 Like

As for the debugging question:
Debugging hard drive issues are usually a problem because when the error message is generated the log no longer gets written to the disk.

The solution I normally use is to forward a copy of the logs over the network to a external system, so that after the crash I can see the error message (and hopefully get useful debugging information)

While incomplete, the beginning of talking about logging is here:

In your case you would need to log dom0, which has not gotten addressed yet in the document, but it appears rsyslog is installed in dom0, so presumably you could just add the file to /etc/rsyslog.d/ on dom0 and it should get to the logging qube.

Unfortunately the part on forwarding the logs to a external system are not complete yet

TLDR; Ignore me

Sorry @qpost135, my answer had nothing to do with your thread. I just saw BTRFS and thought this was a reply to another thread I am involved with. That thread is about Qubes OS being unresponsive for a few seconds when shutting down a large HVM.

@Sven were any of the freezes on the t430? and did they happen for you on 4.xx? Just being random left field.

@Plexus: I mixed up several threads and topics and have in the process probably caused some confusion. Let me try to clean up after myself:

  1. Actual “freezes” as in: the entire system froze (incl. mouse cursor / dom0) and never became responsive again. That I have seen on the T430 with i7-3740QM and i7-3840QM but only when using kernel version 5.4.x … never with v4.19.x – there was nothing in the logs of any value. It appears the computer simply stopped and did nothing until it was hard rebooted. I can no longer reproduce this with the latest 5.4 kernel on the T430.

  2. A dramatic slow-down when shutting down my 100GB Windows HVM (corporate install). Here the mouse and switching desktops still works, but not much else. Even keyboard input slows down and comes to a halt eventually … for a few seconds. Then I get the notification that the qube shut down and everything goes back to normal. In that context someone recommended trying BTRFS. I saw this behavior on both the P51 and the T430 and thought it’s related to ‘trim’ on SSDs.

… I saw @qpost135 post and because it mentioned BTRFS I made the mental jump to the second topic, even thought this thread is related to the first. So @qpost135: you might want to give kernel 4.19 a try and see if that changes anything for you.

1 Like

So far, my theory seems to be valid. I only experienced one freeze after several VM upgrades and several changes. Apart from that, my system seems to be stable now. What a relief… :sweat_smile:

I currently assume that the issue is indeed caused by that SSD. At some point I might try to run a diagnosis on it with a Samsung tool or see if the controller on the SSD can be updated.

Could be this issue: Samsung 860/870 SSDs Continue Causing Problems For Linux Users - Phoronix

1 Like

QVO = So-so, EVO = better, PRO = Best :slight_smile:

1 Like

Sorry for the freezing. I’m not sure if my experience can help you but just something you may want to consider. I too had freezing on AMD. Then I went the route of getting an x230 i7 following the certification specs. That system just works™. Except the large data problem I described in the other thread which hopefuly btrfs will fix. Also that data problem seems to be a qubes problem not a hardware problem. I also bought a second x230 i7 for emergency migrations should my main system dies. Their so cheap buying them used if you hunt around.

I have 16GB ram. If that’s low for you, you gotta make use of minimal templates and lower the default assigned ram on many of your vms. Like if you use many disp-whonix vms simulatanously. there’s likely no reason to have that set to 4GB per disp. Lower it down to 1GB or even less if that’s all you need depending on your browsing situation per disp.

Best of luck.

1 Like

Thanks a lot for that hint. For some strange reason I didn’t come across that information yet.

In the meantime, I’m indeed also observing freezes with my Samsung 860 EVO.

I did a bit of digging and it seems that those TRIM flags are introduced with Kernel 5.15:

{ "Samsung SSD 860*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
					ATA_HORKAGE_ZERO_AFTER_TRIM |
					ATA_HORKAGE_NO_NCQ_ON_ATI, },
{ "Samsung SSD 870*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
					ATA_HORKAGE_ZERO_AFTER_TRIM |
					ATA_HORKAGE_NO_NCQ_ON_ATI, },

I just didn’t find out yet how I could manually deactivate this “queued TRIM” feature before the new kernel comes out.

Any ideas, anyone?

Another useful link in this context: Solid state drive - ArchWiki

Thanks for that suggestion. As far as I currently understand, the root cause of my problem is the controller on the SSD, which somehow messes up this TRIM feature. So, I would assume that another processor would not help with that. But thanks anyway for that suggestion!

1 Like

I asked around, know someone with an x230 and 870 EVO, reporting no freezing.

Tell me if you want me to ask them to check anything in their terminal.

1 Like

Thanks a lot for your effort. I just read here that it also seems to depend on the SATA controller.

I have a “1022” controller, which should supposedly not be affected (as stated in the post linked above).

Maybe it would indeed be interesting if you could send me the SATA controller line of the output of this command: lspci -v -nn

Mine looks like this (with the type “1022”):

06:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) (prog-if 01 [AHCI 1.0])

Thanks!

PRO already in, with BTRFS and restored qubes

1 Like