QubesOS freeze, crash and reboots

forum logistics

Sorry, this is super annoying. @fsflover split of those posts and moved them into a thread in “All Around Qubes”. However this was neither “off-topic” nor an discussion by itself and definitely nothing for “All Around Qubes”.

In the process of cleaning up duplicates where created. Give me a few minutes to clean up here.

1 Like

I have spent the last 2 days playing around trying to gain as much insight into what is the cause of the crashing on my system. Prior to applying the latest dom0 updates I could somewhat reliably recreate the crashes using the method mentioned above where I create a 50gb+ file and copy it between VMs. About 1 in 3 times this would work without crashing, but the other 2 the system would get very hot and eventually crash. What is a bit strange is that I don’t get a log entry stating that the system is overheating (which I was definitely getting a few months ago).

Fortunately the latest dom0 updates seem to have worked wonders on my system and I am no longer able to reproduce this crashing. The system is now able to regulate its temperature very well even when doing 300gb transfers. I will continue testing this over the next few days but this is a very positive sign, I haven’t been able to do a 300gb transfer in about 3 months.

1 Like

Stable or testing repositories?

I haven’t seen a freeze in days either (usually I see one right after writing this ;-).

2 Likes

Out of curiosity, which version of kernel & xen did you use? I wonder if the recent addition (+ revert) in VMs don't boot on 4Kn drive · Issue #7828 · QubesOS/qubes-issues · GitHub had anything to do with improved performance for larger file transfers.

Since last updates x230 no longer freeze :joy: in 4.1.1 stable here. Freeze happen past most frequent with updates. Other seem report same. No probs here today. Single x230 data point - user report only.

1 Like

Stable. That is good news, please keep me updated on whether you encounter one (I will do the same from my end). Based on what I’m hearing from others things are definitely looking up.

After the updates I ended up on kernel version 5.15.64-1 according to uname -r in dom0.

According to dnf list, my xen packages are 4.14.5-7. Is it ok to run dnf commands in dom0? I thought I read that everything had to go through qubes-dom0-update?

This is very good. Please continue to update this thread with whether or not you encounter crashes going forward. I’m still not ready to update my main system quite yet.

1 Like

Crashed again.

Same as usual, overheated during a large file copy.

Do you have an ‘overheated’ message in the log?

You mentioned sensors I think… did you see the temp before the crash?

I think there is an important distinction to be made here. My understanding or “crash” is that the machine turns off, while “freeze” means it becomes totally unresponsive but you can still see what was on screen / have to restart the machine.

I have not seen any “crash” at all but “freezes”.

Could it be that your particular machine has a cooling issue?

To everyone who has reported, or experienced, system stability problems,
it would be incredibly helpful if you could:

  1. Send details of your system, including a rough note of when you
    noticed instability. (e.g “2 months ago”, “since installing 4.1.1”,
    “since updating dom0 last week”, “all the time I have been using Qubes”)
    1a. A baseline of your experience would be helpful - e.g “crashes every
    day”, “at least twice a day”, “occasionally”.
    1b. Some comment on what “triggers” you think you have seen- updates,
    starting qubes, transferring large files, etc.
    1c. If you have been able to find a kernel/xen combination that works for
    you, please send details.

You can check the kernel version using uname -ar, and the Xen version
with xl info

To keep this out of the public eye, you can PM me or email me: my details are here
My PGP key is linked there too.

  1. There’s a suggestion that the latest stable kernel may fix some of
    these problems.
    If possible, please try this and report back.
    It’s really important that you don’t break your system - you can
    avoid removing a kernel that works for you by updating while running
    that kernel.
    You can also set the number of kernels to be retained by editing
    /etc/dnf/dnf.conf, chaining installonly_limit to some higher number than 3

If you have made any changes to the default settings - e.g disabling
swap, limiting dom0 memory to less than the default 4096M, please
revert those changes when testing.
Also, please don’t use the testing repositories for this. That would be a
separate test against a moving target.

One other comment - in my case I’ve tried to reduce the risk of a crash
by drastically changing my work patterns - e.g, I use far fewer qubes at any
time. If you have done the same and think you see an improvement with
some kernel/xen combo try reverting to your old patterns of behaviour
to see what effect that may have.

cheers

1 Like

Well, since I’m using testing repositories, I’m of no help, but

…nor freezes, after 23 days

I was away for awhile, so only noticed it in mid September, mostly during updating dom0 with Qube update tool. It happened no more than 5 times in let’s say 15 days, but I intentionally stopped to update dom0 (I wrote about this) and that’s maybe why I didn’t face it more often. At the same time I was trying suspend for the first time. It was partially successful - sys-whonix and VMs related to it or to Whonix anyhow, would freeze, while other VMs would work, even sys-net. So I stopped testing it since it was faster for me to restart the system than to kill and restart whonix-related VMs, because the process would also involve other VMs (their shutdown-idle, for example would be triggered while waiting for aforementioned VMs to restart, etc). Only later I realized that both noticing of freezes and crahes and crashes vanishing also coincided with suspend time range testing too, beside already mentioned updating dom0…

All the time, still though, I use the same amount of VMs (around 12-18), with 16GB RAM, zram tools everywhere, no swap, 1536MB to dom0.

As I already posted above, my opinion is that updates to dom0 and Xen resolved it, and that it’s not related to kernel because I tried those that worked for others, with no luck.

I start report in public eye and try make better now. Thank all working issues!

user report

System Details:

[user@dom0 ~]$ qubes-hcl-report 
Qubes release 4.1.1 (R4.1)
Brand:		LENOVO
Model:		2320A9U
BIOS:		CBET4000 Heads-v0.2.0-1154-ga3b058d
Xen:		4.14.5
Kernel:		5.15.64-1
RAM:		16340 Mb
CPU:
  Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz Chipset:
  Intel Corporation 3rd Gen Core processor DRAM Controller 8086:0154] (rev 09)
VGA:
  Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09) (prog-if 00 [VGA controller])
Net:
  Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04)
  Qualcomm Atheros AR93xx Wireless Network Adapter (rev 01)
SCSI:
  Samsung SSD 850  Rev: 1B6Q

Noticed Instability Note: Not know when :confused: possible back to 4.1 install 20220509

[user@dom0 ~]$ cat Qubes-HCL-LENOVO-2320A9U-20220509-031547.yml
  xen: |
    4.14.3
  kernel: |
    5.10.90-1

1a) best: occasional freeze requiring hard reset under light test use
worst: freeze require hard reset in hr or 2 of startup and net connect

1b) possible triggers:

  • graphic Qubes Update tool use seem to coincide freeze frequently
  • disp whonix torbrowser use, only 1 or 2 tabs, while updating freeze most occur

1c) kernel/xen 5.15.64-1/4.14.5 combination no freeze testing this long

[user@dom0 ~]$ uptime -p
up 6 days, 8 hours, 47 minutes

Other comments: This test machine very light use. Qubes Update tool and disp whonix torbrowser major use so no shock freeze seen most when both happen. Immediate before freeze (stay view in freeze) on xfce4-sensors-plugin often notice show CPU temp 70C. Temp and fan noise scare me to hard reset. No yet freeze with 5.15.64-1/4.14.5. Thank you @unman! Thank you Qubes team and community!

Not getting the overheat message log. I have gotten it once in the past but other than that the logs just “cut out”.

Mine are all crashes per your definition.

Used this machine on Qubes 4.0 for years and all of Qubes 4.1 without any heating issues. Solid hardware, not on the cheap end.

This temperature + fan noise is what I usually encounter.

Previously I have experienced the crashing when I am quite certain I wasn’t overheating but that was some time ago and might not be an issue anymore.

  1. About 3 months ago I started noticing the issues. I think it was the release of 4.1.1. I can recreate the crashing by performing a qvm-copy of a large file (100gb+) between VMs. With the latest dom0 updates (4.14.5, 5.15.64-1) I crash about 1 in every 10 times I perform qvm-copy, previously it was about 1 in 3. I can tell by monitoring the temperature sensors and listening to the fan whether the system is likely to crash. It either works its way up to around 70-72C and sits there quietly or just starts climbing and the fan spinning out of control from the start. As mentioned above, system has been solid for years.

Obviously if I don’t qvm-copy I can probably avoid crashing for a very long time, but I’m intentionally pushing the system to try and get this issue resolved

1 Like

Just a thought: I recently experienced a similar issue on one of my T430. It turned out that the fan had just “worn out”. I replaced it and no more over heating.

It’s one of the few mechanical parts remaining and depending on how long you used it, it might just be not pulling the same amount of air it did.

I wish you were right, but I’ve got two of them and one has been hardly used.

1 Like

… and it happened again during an update. Freeze: meaning the machine is entirely unresponsive to any input, but is running until actively turned off.

last lines in log before reboot:
Oct 24 23:19:37 dom0 kernel: Linux version 5.15.64-1.fc32.qubes.x86_64 (mockbuild@4b435ee24c154289bb215bd37e0e5c>
-- Reboot --
Oct 24 14:14:31 dom0 kernel: xen-blkback: backend/vbd/40/51760: using 2 queues, protocol 1 (x86_64-abi)
Oct 24 14:14:31 dom0 kernel: xen-blkback: backend/vbd/40/51744: using 2 queues, protocol 1 (x86_64-abi)
Oct 24 14:14:31 dom0 kernel: xen-blkback: backend/vbd/40/51728: using 2 queues, protocol 1 (x86_64-abi)
Oct 24 14:14:31 dom0 kernel: xen-blkback: backend/vbd/40/51712: using 2 queues, protocol 1 (x86_64-abi)
[user@dom0 ~]$ xl info
host                   : dom0
release                : 5.15.64-1.fc32.qubes.x86_64
version                : #1 SMP Mon Sep 5 04:26:01 CEST 2022
machine                : x86_64
nr_cpus                : 4
max_cpu_id             : 7
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2793.656
hw_caps                : bfebfbff:77bae3ff:28100800:00000001:00000001:00000281:00000000:00000100
virt_caps              : pv hvm hvm_directio pv_directio hap
total_memory           : 16340
free_memory            : 10487
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 14
xen_extra              : .5
xen_version            : 4.14.5
xen_caps               : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit2
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          :
xen_commandline        : placeholder console=none dom0_mem=min:1024M dom0_mem=max:2048M ucode=scan smt=off gnttab_max_frames=2048 gnttab_max_maptrack_frames=4096 no-real-mode reboot=no vga=current
cc_compiler            : gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1)
cc_compile_by          : mockbuild
cc_compile_domain      : [unknown]
cc_compile_date        : Wed Aug 24 00:00:00 UTC 2022
build_id               : 78083462a31dbc218d043c59c52f9fa65f71bb04
xend_config_format     : 4
1 Like

It just occurred to me that the above reflects the state after dom0 update as I had to resign the respective files on boot.

The freeze always happens during update of a template.

1 Like

Reminds me of Power Consumption 2-3x after first suspend/resume · Issue #5210 · QubesOS/qubes-issues · GitHub.

People experiencing the cited bug should wait for kernel versions 5.19.17, 6.0.3, 6.1 or above, which contain patches to fix the problem. It gets tracked in Qubes *without* a GUI qube also has issues with granted pages and periodic crashes · Issue #7664 · QubesOS/qubes-issues · GitHub.

1 Like

Is there an ETA to the wait period?

5.19.17 and 6.0.3 have already been released, but are not packaged for Qubes yet, AFAICS. Last time I checked, the testing repositories contained 6.0.2 as kernel-latest. 6.1 has not been released yet, but the patches are part of 6.1-rc1.

I have no ETA on the packaging, but I guess it won’t be long before 6.0.3 will appear in testing.

1 Like