Dom0 has trouble shutting two AppVMs down at the same time

pypy · October 14, 2024, 9:33pm

Hi All,

I feel like Qubes has had a hard time shutting down, or restarting, multiple AppVMs a the same time. I think this started around a year ago, but I don’t know exactly. If I try and shut down two AppVMs at the same time, and I open the qubes applet in the system tray (near the clock applet, wifi applet, etc.), it shows the two AppVMs in red. After about 60 seconds, a warning in dom0 pops up saying one of the AppVMs isn’t responsive. I can tell it to wait another 60 seconds, kill the AppVM, or ignore further warnings. One of these warnings will generally pop up for each AppVM I was shutting down. After I click wait another 60 seconds, sometimes the AppVM is suddenly done shutting down. Also, sometimes audio for all qubes will stop while dom0 is trying to stop multiple AppVMs at a time. Not sure why this is, but perhaps it’s of diagnostic value.

This isn’t a world ending problem, since I can just start and stop AppVMs separately. But when I first started using Qubes, this wasn’t an issue. I’m curious if other people have noticed this behavior or if it’s just me. If it’s just me, then it might be related to disabling swap in the AppVMs (though I gave them a lot of virtual RAM, so I wouldn’t expect that to be the issue). Or it could be something else. I’m open to suggestions.

apparatus · October 15, 2024, 6:15am

You can check the log in /var/log/xen/console/guest-VMNAME.log or connect to the qube’s console before shutting it down using qvm-console-dispvm VMNAME command in dom0 terminal and see where is it stuck when shutting down.
You can also check the CPU (using top and xentop) and disk (using iotop/htop`/etc) usage in dom0 when shutting down the qubes.

fsflover · October 15, 2024, 7:29pm

Could this be related?

github.com/QubesOS/qubes-issues

Need sequential startup of VMs at boot to avoid race condition

opened 11:08AM - 23 Mar 18 UTC

tonsimple

T: enhancement C: core P: default

### Qubes OS version:  R 3.2 ### Affected component(s): Fedora based VMs, Fedora 23, 25, 26 --- ### Steps to reproduce the behavior:  Basically, I have a dedicated VPN VM and a fedora-minimal proxy-VM that is attached to it so that everything that connects to the proxyvm goes through VPN. All VMs (the proxyvm, the vpnvm, and the vpns that connect to them) are set to autostart on boot approximately 19 reboots out of 20 everything works "as expected" about 1 reboot out of 20 the proxyvm just can't send a packet out This suggests a race condition that I can't quite catch (and rebooting the machine constantly is kind of annoying) ### Expected behavior: the config always works like it does most of the time ### Actual behavior: about 1 reboot out of 20 the proxyvm just can't send a packet out ### General notes: When I start manually the issue only arises (about 1 trial out of 40 or so) when I shut them all down, then try to bring up a VM connected to proxyvm (thus qubes try to bring everything up automatically) However, if I bring them up manually in sequence, the issue *never* arises (>100 attempts) Thus, the easiest fix would be to have Qubes bring them up slowly, one by one, in order of [start up the main net vm] _[start up completes]_ [start up vpnvm] _[start up completes]_ [start up proxyvm] _[start up completes]_ [start up other qubes else] However, I can't for the life of mine figure out how to edit VM startup order (ideally, I'd like to have a small pause between them as well, just in case) --- ### Related issues:

pypy · October 17, 2024, 4:55am

I took your suggestion to look through the log files. For convenience I generated a new incident by starting two AppVMs, and then asked both to reboot as quickly as I could via the Qubes Manager gui. As usual, the AppVMs took a long time to reboot, and the warning in dom0 about waiting 60 seconds popped up.

I don’t have a lot of experience parsing /var/log/xen/console/guest-VMNAME.log files, but the time stamps are interesting. For AppVM 1, there is a 54 second gap between the “System halted” time stamp, and “Logfile Opened” time stamp. For AppVM 2, it’s a 70 second gap. I assume “System halted” is the last recorded event when shutting down for a reboot, and “Logfile Opened” is the first recorded event for a system coming back up. Before “System halted”, there are lots of events happening with a time stamp all within a few seconds of each other. And after “Logfile Opened”, there are also lots of events within a few second of each other.

This implies to me that maybe the problem is in dom0 as opposed to the AppVMs?

AppVM 1 log file
[2024-10-16 21:22:29] [ 470.295190] reboot: System halted^M
[2024-10-16 21:23:23] Logfile Opened
[2024-10-16 21:23:24] [ 0.000000] Linux version 6.6.48-1.qubes.fc37.x86_64

AppVM 2 log file
[2024-10-16 21:22:25] [ 475.776852] reboot: System halted^M
[2024-10-16 21:23:35] Logfile Opened
[2024-10-16 21:23:36] [ 0.000000] Linux version 6.6.48-1.qubes.fc37.x86_64

The qvm-console-dispvm VMNAME command seemed to pop up a disposable VM terminal with events logged in it. When I restarted it to watch what the terminal would read out, it happened pretty fast and I couldn’t read fast enough to follow what was going on.

pypy · October 17, 2024, 5:04am

In the race condition issue ticket it says that the problem is sporadic, happening maybe 1 in 20 times. Whatever is happening with my Qubes install is much more easily reproducible. So on the surface it seems my problem is different, or the issue has worsened enough since Qubes 3.2 that it’s very reproducible now, or my problem has a similar root to the previous issues but for some reason it’s much more exaggerated in my case. I’m not a programmer, so my insight is limited. If solving my issue can be generalized to helping other, that would certainly be ideal!

apparatus · October 17, 2024, 5:50pm

Yes, it’s somewhere in dom0.
Check the dom0 log using journalctl to see what happens during the qubes reboot.
Also check the CPU and disk usage in dom0 during qubes reboot.

pypy · November 16, 2024, 7:51pm

I looked through journalctl on dom0, but I didn’t see anything that looked like a smoking gun. That being said, I also don’t know linux that well, nor what to look for in journalctl… I’ve decided for now it’s not a deal breaker and will live with it.

alimirjamali · November 16, 2024, 8:15pm

Increase the shutdown_timeout property for the mentioned AppVMs (or any VM) and let them shutdown gracefully. Command is qvm-prefs VMNAME shutdown_timeout SECONDS. I believe the default is 60 seconds. I have many VMs with 480s shutdown timeout. This is a (relatively) slow computer with heavy load.

Also increase qrexec_timeout if you have trouble with starting multiple AppVMs together.

apparatus · November 16, 2024, 8:15pm

Do you have SSD/NVMe or HDD?
Maybe it’s some issue with the disk.

pypy · November 16, 2024, 8:35pm

Storage is a SSD with NVMe. I’ve been running qubes on this hardware for a couple years, so its unlikely it’s a hardware problem. Feels more like a software or config issue to me.

pypy · November 16, 2024, 8:36pm

I’m happy to try. Where do the shutdown_timeout and qrexec_timeout properties live?

alimirjamali · November 16, 2024, 8:40pm

You can change them from dom0 terminal with the command I mentioned. And they live in a file in dom0 (/var/lib/qubes/qubes.xml).

apparatus · November 16, 2024, 8:43pm

Do you have fstrim enabled?
If not then try to enable the disk trimming:

pypy · November 16, 2024, 8:45pm

Doh. My bad, I see the command you sent earlier now.

pypy · November 16, 2024, 9:29pm

Is there a way to use qvm-prefs to change the shutdown_timeout on all VMs instead of having to run the command individually for each?

alimirjamali · November 16, 2024, 9:34pm

Run qubes-prefs default_shutdown_timeout VALUE and set it for the entire system.

alimirjamali · November 16, 2024, 9:37pm

Related open issue (which I might work on it at some point):

github.com/QubesOS/qubes-issues

flexible (read: longer) timeouts for slow media

opened 08:40PM - 07 Feb 20 UTC

sjvudp

T: enhancement C: core P: default

**The problem you're addressing (if any)** I installed qubes OS on a new USB st…ick (not an SSD drive or hard disk). As the stick performs poorely on random writes, execution stalled most of the time (like 8 of 8 cores being more than 90% on I/O wait, and installation took almost three hours). When booting or starting VMs, failure due to timeouts were reported. **Describe the solution you'd like** First, timeouts should be used as a last resort only if there is no other indication for an error possible. Second, timeouts should be adjustable, or being set to higher values. **Where is the value to a user, and who might that user be?** Usability, especially when evaluating whether Qubes OS is "the right thing" for you. **Describe alternatives you've considered** Wait (being bored and frustrated) and retry. **Additional context** A similar issue exists on boot when three VMs are started in parallel: In my experience performance is much better when they are started sequentially, especially when considering that the three VMs depend on each other. For me the VM start ran into a timeout, and I had to start them manually later. Similar things happen when you try to update a VM: When it's being started before upgrade, it does not complete within 60s somethimes, and the update attempt pops up errors.

pypy · November 16, 2024, 10:40pm

It turns out I’m totally fine emotionally waiting for AppVMs to restart if the “it’s been 60 seconds, what should I do” pop up doesn’t appear. Thank you!

pypy · November 16, 2024, 10:55pm

I’m a little hesitant to try disk trimming without backing up my whole system first. I’m interested in trying this, but it will have to wait for a less busy weekend for me.