Whonix qubes randomly shutdown, especially gateway

I’ve had qubes randomly shutting down on me lately. So far it’s been just qubes based on whonix templates and usually it is a whonix gateway. Usually the whonix workstation will not run without the gateway booting first, but what has been happening is a service goes offline,the workstation is still running but the gateway has crashed.

I had to make a change to my Xen configuration to get my machine to work with recent versions: No Qubes/VMs boot after latest updates - #32 by scallyob

It’s possible this problem started after making that change.

Here’s the log from the last crash from /var/log/xen/console/guest-sys-whonix.log:

[2024-05-27 13:41:13] [20937.502042] #PF: supervisor read access in kernel mode^M
[2024-05-27 13:41:13] [20937.503416] #PF: error_code(0x0000) - not-present page^M
[2024-05-27 13:41:13] [20937.504752] PGD 0 P4D 0 ^M
[2024-05-27 13:41:13] [20937.505434] Oops: 0000 [#1] PREEMPT SMP NOPTI^M
[2024-05-27 13:41:13] [20937.506609] CPU: 0 PID: 56 Comm: xenbus Not tainted 6.6.29-1.qubes.fc37.x86_64 #1^M
[2024-05-27 13:41:13] [20937.508455] RIP: 0010:__wake_up_common+0x4c/0x180^M
[2024-05-27 13:41:13] [20937.509960] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40^M
[2024-05-27 13:41:13] [20937.514811] RSP: 0018:ffffc90000dabdf0 EFLAGS: 00010082^M
[2024-05-27 13:41:13] [20937.515678] RAX: 0000000000000000 RBX: ffff88802e9f9b98 RCX: 0000000000000000^M
[2024-05-27 13:41:13] [20937.517510] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88802e9f9b90^M
[2024-05-27 13:41:13] [20937.519184] RBP: 0000000000000246 R08: ffffffffffffffe8 R09: ffffc90000dabe48^M
[2024-05-27 13:41:13] [20937.520675] R10: ffff88800d3d6ea8 R11: ffffc9000002d000 R12: ffffc90000dabe48^M
[2024-05-27 13:41:13] [20937.521637] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[2024-05-27 13:41:13] [20937.522753] FS:  0000000000000000(0000) GS:ffff888018400000(0000) knlGS:0000000000000000^M
[2024-05-27 13:41:13] [20937.523543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[2024-05-27 13:41:13] [20937.524155] CR2: 0000000000000000 CR3: 0000000006b7a000 CR4: 00000000000406f0^M
[2024-05-27 13:41:13] [20937.524767] Call Trace:^M
[2024-05-27 13:41:13] [20937.524946]  <TASK>^M
[2024-05-27 13:41:13] [20937.525127]  ? __die+0x23/0x70^M
[2024-05-27 13:41:13] [20937.525397]  ? page_fault_oops+0x98/0x190^M
[2024-05-27 13:41:13] [20937.525669]  ? exc_page_fault+0x77/0x170^M
[2024-05-27 13:41:13] [20937.525940]  ? asm_exc_page_fault+0x26/0x30^M
[2024-05-27 13:41:13] [20937.526215]  ? __wake_up_common+0x4c/0x180^M
[2024-05-27 13:41:13] [20937.526484]  __wake_up_common_lock+0x82/0xd0^M
[2024-05-27 13:41:13] [20937.526839]  ? __pfx_xenbus_thread+0x10/0x10^M
[2024-05-27 13:41:13] [20937.527196]  process_msg+0x18e/0x2f0^M
[2024-05-27 13:41:13] [20937.527464]  xenbus_thread+0x4a/0x1e0^M
[2024-05-27 13:41:13] [20937.527732]  ? __pfx_autoremove_wake_function+0x10/0x10^M
[2024-05-27 13:41:13] [20937.528090]  kthread+0xe8/0x120^M
[2024-05-27 13:41:13] [20937.528360]  ? __pfx_kthread+0x10/0x10^M
[2024-05-27 13:41:13] [20937.528629]  ret_from_fork+0x34/0x50^M
[2024-05-27 13:41:13] [20937.528904]  ? __pfx_kthread+0x10/0x10^M
[2024-05-27 13:41:13] [20937.529173]  ret_from_fork_asm+0x1b/0x30^M
[2024-05-27 13:41:13] [20937.529443]  </TASK>^M
[2024-05-27 13:41:13] [20937.529622] Modules linked in: nf_conntrack_netlink nft_flow_offload nf_flow_table_inet nf_flow_table xen_netback dummy ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xenfs xt_multiport xt_nat xt_owner xt_REDIRECT nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic binfmt_misc ghash_clmulni_intel nf_tables sha512_ssse3 sha256_ssse3 nfnetlink xen_netfront sha1_ssse3 xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn fuse loop ip_tables overlay xen_blkfront^M
[2024-05-27 13:41:13] [20937.533019] CR2: 0000000000000000^M
[2024-05-27 13:41:13] [20937.533290] ---[ end trace 0000000000000000 ]---^M
[2024-05-27 13:41:13] [20937.533642] RIP: 0010:__wake_up_common+0x4c/0x180^M
[2024-05-27 13:41:13] [20937.533998] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40^M
[2024-05-27 13:41:13] [20937.704565] RSP: 0018:ffffc90000dabdf0 EFLAGS: 00010082^M
[2024-05-27 13:41:13] [20937.705264] RAX: 0000000000000000 RBX: ffff88802e9f9b98 RCX: 0000000000000000^M
[2024-05-27 13:41:13] [20937.706294] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88802e9f9b90^M
[2024-05-27 13:41:13] [20937.707311] RBP: 0000000000000246 R08: ffffffffffffffe8 R09: ffffc90000dabe48^M
[2024-05-27 13:41:13] [20937.708988] R10: ffff88800d3d6ea8 R11: ffffc9000002d000 R12: ffffc90000dabe48^M
[2024-05-27 13:41:13] [20937.709805] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[2024-05-27 13:41:13] [20937.710685] FS:  0000000000000000(0000) GS:ffff888018400000(0000) knlGS:0000000000000000^M
[2024-05-27 13:41:13] [20937.712434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[2024-05-27 13:41:13] [20937.713674] CR2: 0000000000000000 CR3: 0000000006b7a000 CR4: 00000000000406f0^M
[2024-05-27 13:41:13] [20937.714301] Kernel panic - not syncing: Fatal exception^M
[2024-05-27 13:41:13] [20937.718046] Kernel Offset: disabled^M

Seems related:

I posted a bit on the github issue.

Since turning off memory balancing and increasing RAM and CPU, i’m still getting crashes every 2 weeks. Still always Whonix-based VMs.

This is a pretty significant problem for me. Not sure if I should try new hardware or what to do.

As this is an ongoing problem with unknown cause, I created the following bashscript and set it to run as a cron job every hour. This will meet my needs to keep services up regularly enough.

#!/bin/bash
# leading spaces for names that overlap
servers=(" vm1" " vm2" " vm3" " vm4")

statuslist=$(qvm-ls --fields=status,name,state | grep Running)
echo $statuslist
for i in "${servers[@]}"; do
	echo $i
	if [[ ! $statuslist =~ $i ]]; then
		d=$(date)
		echo "$d: $i not running" >> /home/user/servercheck.log
		qvm-start $i
	fi
done