I feel like Qubes has had a hard time shutting down, or restarting, multiple AppVMs a the same time. I think this started around a year ago, but I don’t know exactly. If I try and shut down two AppVMs at the same time, and I open the qubes applet in the system tray (near the clock applet, wifi applet, etc.), it shows the two AppVMs in red. After about 60 seconds, a warning in dom0 pops up saying one of the AppVMs isn’t responsive. I can tell it to wait another 60 seconds, kill the AppVM, or ignore further warnings. One of these warnings will generally pop up for each AppVM I was shutting down. After I click wait another 60 seconds, sometimes the AppVM is suddenly done shutting down. Also, sometimes audio for all qubes will stop while dom0 is trying to stop multiple AppVMs at a time. Not sure why this is, but perhaps it’s of diagnostic value.
This isn’t a world ending problem, since I can just start and stop AppVMs separately. But when I first started using Qubes, this wasn’t an issue. I’m curious if other people have noticed this behavior or if it’s just me. If it’s just me, then it might be related to disabling swap in the AppVMs (though I gave them a lot of virtual RAM, so I wouldn’t expect that to be the issue). Or it could be something else. I’m open to suggestions.
You can check the log in /var/log/xen/console/guest-VMNAME.log or connect to the qube’s console before shutting it down using qvm-console-dispvm VMNAME command in dom0 terminal and see where is it stuck when shutting down.
You can also check the CPU (using top and xentop) and disk (using iotop/htop`/etc) usage in dom0 when shutting down the qubes.
I took your suggestion to look through the log files. For convenience I generated a new incident by starting two AppVMs, and then asked both to reboot as quickly as I could via the Qubes Manager gui. As usual, the AppVMs took a long time to reboot, and the warning in dom0 about waiting 60 seconds popped up.
I don’t have a lot of experience parsing /var/log/xen/console/guest-VMNAME.log files, but the time stamps are interesting. For AppVM 1, there is a 54 second gap between the “System halted” time stamp, and “Logfile Opened” time stamp. For AppVM 2, it’s a 70 second gap. I assume “System halted” is the last recorded event when shutting down for a reboot, and “Logfile Opened” is the first recorded event for a system coming back up. Before “System halted”, there are lots of events happening with a time stamp all within a few seconds of each other. And after “Logfile Opened”, there are also lots of events within a few second of each other.
This implies to me that maybe the problem is in dom0 as opposed to the AppVMs?
AppVM 1 log file
[2024-10-16 21:22:29] [ 470.295190] reboot: System halted^M
[2024-10-16 21:23:23] Logfile Opened
[2024-10-16 21:23:24] [ 0.000000] Linux version 6.6.48-1.qubes.fc37.x86_64
AppVM 2 log file
[2024-10-16 21:22:25] [ 475.776852] reboot: System halted^M
[2024-10-16 21:23:35] Logfile Opened
[2024-10-16 21:23:36] [ 0.000000] Linux version 6.6.48-1.qubes.fc37.x86_64
The qvm-console-dispvm VMNAME command seemed to pop up a disposable VM terminal with events logged in it. When I restarted it to watch what the terminal would read out, it happened pretty fast and I couldn’t read fast enough to follow what was going on.
In the race condition issue ticket it says that the problem is sporadic, happening maybe 1 in 20 times. Whatever is happening with my Qubes install is much more easily reproducible. So on the surface it seems my problem is different, or the issue has worsened enough since Qubes 3.2 that it’s very reproducible now, or my problem has a similar root to the previous issues but for some reason it’s much more exaggerated in my case. I’m not a programmer, so my insight is limited. If solving my issue can be generalized to helping other, that would certainly be ideal!
Yes, it’s somewhere in dom0.
Check the dom0 log using journalctl to see what happens during the qubes reboot.
Also check the CPU and disk usage in dom0 during qubes reboot.
I looked through journalctl on dom0, but I didn’t see anything that looked like a smoking gun. That being said, I also don’t know linux that well, nor what to look for in journalctl… I’ve decided for now it’s not a deal breaker and will live with it.
Increase the shutdown_timeout property for the mentioned AppVMs (or any VM) and let them shutdown gracefully. Command is qvm-prefs VMNAME shutdown_timeout SECONDS. I believe the default is 60 seconds. I have many VMs with 480s shutdown timeout. This is a (relatively) slow computer with heavy load.
Also increase qrexec_timeout if you have trouble with starting multiple AppVMs together.
Storage is a SSD with NVMe. I’ve been running qubes on this hardware for a couple years, so its unlikely it’s a hardware problem. Feels more like a software or config issue to me.
It turns out I’m totally fine emotionally waiting for AppVMs to restart if the “it’s been 60 seconds, what should I do” pop up doesn’t appear. Thank you!
I’m a little hesitant to try disk trimming without backing up my whole system first. I’m interested in trying this, but it will have to wait for a less busy weekend for me.