Very slow shutdown of large VMs in 4.2 (even with BTRFS)

This is a fresh install of QubesOS 4.2. It works great so far but when shutting down large VMs (e.g. 700 GB or more), the shutdown takes ~10 minutes. With 4.1, the same VMs were shutting down far below 1 minute on exactly the same hardware.

I’m already using BTRFS (see also this comment) and I’ve already tried to set “revisions_to_keep” to “0” for those VMs but it didn’t make a difference.

I didn’t have any issues with this with 4.1 and BTRFS.

Any suggestions would be highly appreciated, thanks!

FYI: Did a fresh install with default options. No BTRFS this time. 800GB vault. Didn’t change revisions to 0. Shutdowns are slower than with BTRFS at ~10-15 seconds.

You can always check the logs with journalctl.
Is it a networked vm?
If it’s not too late you could try a fresh installation with default options.

Thanks, @Johnboy ! Strange though, if indeed the default options would work better now than the previously recommended BTRFS. I would prefer not to re-install again.

It’s a completely offline VM and also not a disposable VM.

I did some more testing and this is what I’ve seen:

  • The current duration is actually 2.5 Minutes
  • By observing the “Total_LBAs_Written” on the SSD (using the command “smartctl” and an online calculator), I can see that about 12 GB are written to the SSD during that time (at least it’s not writing those ~700 GB to the disk - but still strange that it even writes that much to the disk)
  • Looking at the “journalctl”, it seems that the processing of several versions of “priave.img” take some time. But I don’t fully understand those entries yet.
Dez 26 00:34:02 dom0 qubesd[4492]: Removed file: '/var/lib/qubes/appvms/data/volatile-dirty.img'
Dez 26 00:34:02 dom0 qubesd[4492]: Removed file: '/var/lib/qubes/appvms/data/root-dirty.img'
Dez 26 00:34:02 dom0 qubesd[4492]: Removed file: '/var/lib/qubes/appvms/data/root.img'
Dez 26 00:34:32 dom0 qubesd[4492]: Reflinked file: '/var/lib/qubes/appvms/data/private.img' -> '/var/lib/qubes/appvms/data/private.img.7@2023-12-25T22:43:32Z~31r9tbao'
Dez 26 00:34:44 dom0 qubesd[4492]: Renamed file: '/var/lib/qubes/appvms/data/private.img.7@2023-12-25T22:43:32Z~31r9tbao' -> '/var/lib/qubes/appvms/data/private.img.7@2023-12-25T22:43:32Z'
Dez 26 00:35:17 dom0 qubesd[4492]: Removed file: '/var/lib/qubes/appvms/data/private.img.6@2023-12-25T16:36:08Z'
Dez 26 00:35:51 dom0 qubesd[4492]: Renamed file: '/var/lib/qubes/appvms/data/private-dirty.img' -> '/var/lib/qubes/appvms/data/private.img'
Dez 26 00:36:22 dom0 qubesd[4492]: Reflinked file: '/var/lib/qubes/appvms/data/private.img' -> '/var/lib/qubes/appvms/data/private-precache.img~alqwige5'
Dez 26 00:36:27 dom0 qubesd[4492]: Renamed file: '/var/lib/qubes/appvms/data/private-precache.img~alqwige5' -> '/var/lib/qubes/appvms/data/private-precache.img'

Any suggestions on what to further check/monitor here?

Fragmentation? It can be checked with

sudo filefrag /var/lib/qubes/appvms/data/private.img

To defragment, see Boot time of 25 minutes due to fragmentation of monero blockchain [solved] - #17 by rustybird (but adjusted with the Btrfs specific step at the end of that thread)

1 Like

I did a fresh reinstall with btrfs now.
I experience the same extremly long shutdown times and even bootup times of my vault.
The overall system experience feels more snappy and interactive with btrfs (strange) than with default partitions/formatting.
I ended up with 5-7 seconds shutdown times then.
Now with btrfs it feels as if all 800GB are being written for the revision and it takes 10-20 minutes and the system is unresponsive/not usable at that time.
I will keep testing if it swings to 5-7s as with default installation, and will keep testing with revisions=0.

1 Like

Extremely wild guess, but would you try

1 Like

I don’t have a swap partition at all.
Revisions_to_keep=0 helps a lot but still takes minutes compared with default partitioning and revisions with shutdown times of 5-7 sec.
Personally i probably switch back to default partitioning.

1 Like

Thanks, @Johnboy for testing and confirming the issue on your side.

Did you experience the lower snappyness again when you moved back to the default partitions/formatting? Can you describe that part a bit more? (also considering moving to the defaults)

How do you re-install your system so quickly with such a lot of data? Don’t you have to manually copy everything back after the installation? (I’m a bit concerned on my side to put to much strain on my SSD with such re-installs)

On my side: I’ve also set revisions=0 and after a reboot, both the startup time and shutdown time of my larger Qubes are taking about 2 minutes.

@rustybird Thanks, I’ll look into the defragmentation topic. Looks promising but it would be strange if that would indeed be it based on how @Johnboy was able to replicate that behavior also his side.

The snappiness was probably just subject to the new appmenu, which seems a bit slow and laggy. The overall performance is as good as on btrfs.
Editing a single file in my vault and shutting it down afterwars takes ~15 seconds. It’s not as good as with btrfs on R4.1 but way better that compared to minutes (with revisions_to_keep=0) or even 20 minutes.

I did a backup before upgrading/reinstalling qubes. Copying 1 TB takes a bit of time but my external backup HDD can read up to 130 MB/s. And the performance with btrfs was just unbearable. I’m more concerned with swap settings regarding my nvme lifetime than installing R4.2 three times.

You probably need to restart your larger qubes 1-2 more times. The more data changes for the revision, the longer it takes. And if you’re concerned about your SSD lifespan you definately want to reinstall with default partitioning (unless there’s a patch in the near future coming).

1 Like

Thanks a lot to @Johnboy, @tempmail and @rustybird for your suggestions!

I finally did a reinstall of QubesOS 4.2 with the default options from the installer (no BTRFS) and this makes a huge difference! Before the re-install I had shutdown times of over 1 minute for a ~700 GB VM. Now with the default options, I have shutdown times of about 6 seconds.
It seems to me that fragmentation was not the issue in my case.

Conclusion: The default options seem to be the better choice these days instead of BTRFS.