Speeding up VM startups

So I’ve been observing really slow vm startup on my supposedly pretty fast setup (12th gen i9 cpu with DDR5 RAM). To put some numbers at it:
I measured 25 seconds from the moment I press the “Start firefox in dispvm” to actually seeing a firefox window appear on the screen. Very annoying and is totally embarrassing considering various cloud providers got this down to science and have sub one second VM startup times it appears.

So I have decided to try and track at least some of it down and I guess the findings might be of interest to others here.

First of all I decided to see why does the kernel take so long to startup, checking the messages (sudo dmesg | less in a vm) I discovered that there’s wasteful (cpu-time wise) clearing of already clear VM pages. I filed a ticket: Wasteful memory clearing at start of every (standard) qube kernel · Issue #8228 · QubesOS/qubes-issues · GitHub
Add kernel option init_on_free=off to your vm kernel options with something like this (this rewrites the options, but I imagine nobody has them set nowadays anyway):

qvm-prefs -s YOUR-VM-NAME kernelopts "init_on_free=off"

if you do it to a disp vm template, then disposables will inherit this option too. Shaves precious seconds off the vm boot time. I measured 3+ seconds savings in the boot time on gen8 i7 cpu cpu using the default 4G maxmem qube. slower CPUs will benefit more than faster ones.

Next is this:

[    2.327810] xen:balloon: Waiting for initial ballooning down having finished.
[    3.058150] xen:balloon: Initial ballooning down finished.

So looks like if you don’t do the “don’t include in memory balancing” option, it’s not used and you get corresponding savings at the expense of fixed amount of RAM used by a qube.
so you basically don’t load this balloon driver and get the savings from its init. Somebody could look into what the init does and actually optimize it I guess. Again slower CPUs benefit more here. but other factors seem to have big impact too. free xen ram?

[    1.309210] xen:balloon: Waiting for initial ballooning down having finished.
[    7.153167] xen:balloon: Initial ballooning down finished.

Now after that all is done what could be observed is it takes another long period (again slower CPUs = longer, gen8 i7 cpu takes two seconds) of 1-2 seconds to run initrd (from the time we see the init message in kernel log to the time we see systemd message):

[    7.160447] Run /init as init process
[    8.707752] systemd[1]: systemd 253.4-1.fc38 running in system mode (+PAM +AU
DIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID
 +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +
PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XK
BCOMMON +UTMP +SYSVINIT default-hierarchy=unified)

part of this is detection of block devices and other init, but another easy part to shave a bit of time (esp. on slower CPUs) is to uncompress the initrd.
That shaves another half a second or so on my 8th get i7:

[    5.461441] Run /init as init process
[    6.611702] systemd[1]: systemd 253.4-1.fc38 running in system mode (+PAM +AU
DIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID
 +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +
PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XK
BCOMMON +UTMP +SYSVINIT default-hierarchy=unified)

and after that you can do systemd-analyze to make sure the rest of the init takes less than a second. Which in case of fedora dispvm is most likely does not, it probably takes more than 18 seconds in userspace init, and if you use systemd-analyze blame you’ll discover that unbound-anchor service is taking whopping 17 seconds. This service is doing some dnssec related init and in majority of cases it’s not really important, so her, disable it in most of templates (obviously perform your own risk assessment, don’t trust me) and you’ve just shaved another 17 seconds off the time. Alternatively you can run your dvm template with network access every few days so the cache is current.

This brings startup inside the VM pretty low for me, but the externally observable time is still slow (like ~25 → ~15 seconds from button press to firefox window opening) so I guess I’ll now have to see on the xen/dom0 side what’s up there. Ideally I really want to reach sub 5 seconds for firefox window as the first step.

2 Likes

I also have a 12th i9, but with DDR4 memory.

Without memory balancing, it takes 9 sec to open a disposable Firefox with the window showing. With memory balancing, it’s 12 sec.

What type of drives do you use?

gen4 nvme (wd black 850 or some such, unscientific read speed tests with dd show 2-4 GB/sec)

I just did a trial run (don’t have the kernel options and stuff, just the anchor service disabled) and it’s still kinda bad:

[root@dom0 ~]# time qvm-run --dispvm fedora-37-dvm-no-dnssec date
Running 'date' on $dispvm:fedora-37-dvm-no-dnssec

real	0m34.429s
user	0m0.047s
sys	0m0.043s

I realize this time also includes shutdown

[root@disp3083 ~]# systemd-analyze 
Startup finished in 4.778s (kernel) + 1.910s (userspace) = 6.688s 
multi-user.target reached after 1.901s in userspace.

so the time is clearly spent elsewhere.

Yeah, just thought that some part of the DVM initialization could be IO intensive.

I tried running the same command on my system, but I only have debian templates

$ time qvm-run --dispvm debian-11-dvm date
Running 'date' on $dispvm:debian-11-dvm

real	0m12.473s
user	0m0.037s
sys	0m0.021s
$ sudo systemd-analyze
Startup finished in 2.272s (kernel) + 3.110s (userspace) = 5.382s 
multi-user.target reached after 2.111s in userspace

even with a debian template I cannot reproduce these numbers even if I reduce the dvm memory to 400M and remove the memory balancing/balloon:

[root@dom0 ~]# time qvm-run --dispvm debian-11-dvm date
Running 'date' on $dispvm:debian-11-dvm

real	0m23.038s
user	0m0.076s
sys	0m0.060s
[root@dom0 ~]# time qvm-run --dispvm debian-11-dvm date
Running 'date' on $dispvm:debian-11-dvm

real	0m22.366s
user	0m0.073s
sys	0m0.057s

even though

Startup finished in 1.312s (kernel) + 1.974s (userspace) = 3.286s 
multi-user.target reached after 1.859s in userspace

and I have 12G of free xen RAM so it did not have to ask any existing VMs to free anythign before starting.

You might be interested in [1].

I just added a sample of one of my machines as comparison [2].

IIRC Xen ballooning cannot be disabled as it’s an essential part of the Xen memory management.

[1] GitHub - 3hhh/qubes-performance: Analyze Qubes OS VM startup performance.
[2] qubes-performance/t530_debian-11_01.txt at master · 3hhh/qubes-performance · GitHub

1 Like

when you uncheck the “include in memory balancing” box in the settings you essentially disable the driver. It’s still loaded but it no longer takes out any memory from the VM (which appears to happen on bootup too and can take substantial time as you can see in kernel messages) so the VM memory remains fixed.

I did see the qubes performance analyzer, but necessity to install random stuff in dom0 gives me a bit of a pause, I am looking through the code trying to see what’s being analyzed and how, it’s just not all of it is obvious (systemd-analyzer is obvious and I can just run it by hand, which I did). I only have one of this kind of a fast machine with qubes so I cannot just install a throwaway instance to install random tools and then wipe it all clean unfortunately. I do have two of thinkpad T480 where I am doing this, but it actually spins VMs faster as is, which is at least somewhat perplexing.

Or we can totally change the way DVMs work: we can always have a “hot standby” dvm that would be available immediately and launch a new one as well. This would be a HUGE speed-up!

2 Likes

I’ve been interested in this idea for a while but never explored it. Looking into it, I see that Sven has already made a guide for it :star_struck:

2 Likes

it’s a tradeoff though, you’ll have to reserve the RAM that could have otherwise be used by other VMs. Also unless you completely pause the “hot standby” one, it’ll waste some amount of CPU for whatever periodic tasks running in there (case in point, I have dispvm started 2.5 days ago, with a file manager and xterm windows open, using xentop I can see it consistently uses 1.5% of CPU while idle. Accumulated about an hour of runtime since then.) The latter is bad for laptops on battery power in particular.

In contrast having lighting-fast startup has basically no downsides, just benefits.

so after checking this scripts some more, it looks like it would not pinpoint the slowness in my case outside of saying it’s there (something I can observe myself), but looking at the log there’s a very strange 10 seconds gap in the log for the startup:

May 29 15:36:46.977992 dom0 qubesd[3758]: vm.debian-11-minimal: Starting debian-11-minimal
May 29 15:36:47.349070 dom0 lvm[2018]: No longer monitoring thin pool qubes_dom0-vm--pool-tpool.
May 29 15:36:47.386985 dom0 lvm[2018]: Monitoring thin pool qubes_dom0-vm--pool-tpool.
May 29 15:36:47.626892 dom0 lvm[2018]: No longer monitoring thin pool qubes_dom0-vm--pool-tpool.
May 29 15:36:47.644128 dom0 lvm[2018]: Monitoring thin pool qubes_dom0-vm--pool-tpool.
May 29 15:36:50.715335 dom0 root[559016]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/154/51728
May 29 15:36:50.720942 dom0 root[559014]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/154/51744
May 29 15:36:50.723868 dom0 root[559015]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/154/51712
May 29 15:36:50.754018 dom0 root[559037]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/154/51760
May 29 15:36:52.908808 dom0 root[559359]: /etc/xen/scripts/block: Writing backend/vbd/154/51728/physical-device fd:af to xenstore.
May 29 15:36:52.962172 dom0 root[559361]: /etc/xen/scripts/block: Writing backend/vbd/154/51728/physical-device-path /dev/dm-175 to xenstore.
May 29 15:36:53.014788 dom0 root[559363]: /etc/xen/scripts/block: Writing backend/vbd/154/51728/hotplug-status connected to xenstore.
May 29 15:36:54.696658 dom0 root[559632]: /etc/xen/scripts/block: Writing backend/vbd/154/51712/physical-device fd:b5 to xenstore.
May 29 15:36:54.725539 dom0 root[559634]: /etc/xen/scripts/block: Writing backend/vbd/154/51712/physical-device-path /dev/dm-181 to xenstore.
May 29 15:36:54.752961 dom0 root[559636]: /etc/xen/scripts/block: Writing backend/vbd/154/51712/hotplug-status connected to xenstore.
May 29 15:36:56.268136 dom0 root[559902]: /etc/xen/scripts/block: Writing backend/vbd/154/51744/physical-device fd:97 to xenstore.
May 29 15:36:56.275945 dom0 root[559904]: /etc/xen/scripts/block: Writing backend/vbd/154/51744/physical-device-path /dev/dm-151 to xenstore.
May 29 15:36:56.312579 dom0 root[559906]: /etc/xen/scripts/block: Writing backend/vbd/154/51744/hotplug-status connected to xenstore.
May 29 15:36:56.573040 dom0 kernel: loop20: detected capacity change from 0 to 952304
May 29 15:36:56.592760 dom0 root[559931]: /etc/xen/scripts/block: Writing backend/vbd/154/51760/node /dev/loop20 to xenstore.
May 29 15:36:56.650861 dom0 root[559935]: /etc/xen/scripts/block: Writing backend/vbd/154/51760/physical-device 7:14 to xenstore.
May 29 15:36:56.705671 dom0 root[559937]: /etc/xen/scripts/block: Writing backend/vbd/154/51760/physical-device-path /dev/loop20 to xenstore.
May 29 15:36:56.743555 dom0 root[559939]: /etc/xen/scripts/block: Writing backend/vbd/154/51760/hotplug-status connected to xenstore.
May 29 15:36:56.986521 dom0 qubesd[3758]: vm.debian-11-minimal: Setting Qubes DB info for the VM
May 29 15:36:56.986957 dom0 qubesd[3758]: vm.debian-11-minimal: Starting Qubes DB

Comparing to my much slower laptop, the gap is not there. now to find what causes it I guess.

but we don’t have lightning fast startup anyway :slight_smile:

Just an idea: have you checked your CPU frequencies? On my system I found that the default driver wasn’t able to use Turbo Boost at all, thus hobbling the system to less than 50% theoretical performance (if also considering that SMT is off). I had to switch to hwp as the driver, see this thread for details on hwp and alder lake cpu performance as well as core pinning.