So I’ve been observing really slow vm startup on my supposedly pretty fast setup (12th gen i9 cpu with DDR5 RAM). To put some numbers at it:
I measured 25 seconds from the moment I press the “Start firefox in dispvm” to actually seeing a firefox window appear on the screen. Very annoying and is totally embarrassing considering various cloud providers got this down to science and have sub one second VM startup times it appears.
So I have decided to try and track at least some of it down and I guess the findings might be of interest to others here.
First of all I decided to see why does the kernel take so long to startup, checking the messages (
sudo dmesg | less in a vm) I discovered that there’s wasteful (cpu-time wise) clearing of already clear VM pages. I filed a ticket: Wasteful memory clearing at start of every (standard) qube kernel · Issue #8228 · QubesOS/qubes-issues · GitHub
Add kernel option
init_on_free=off to your vm kernel options with something like this (this rewrites the options, but I imagine nobody has them set nowadays anyway):
qvm-prefs -s YOUR-VM-NAME kernelopts "init_on_free=off"
if you do it to a disp vm template, then disposables will inherit this option too. Shaves precious seconds off the vm boot time. I measured 3+ seconds savings in the boot time on gen8 i7 cpu cpu using the default 4G maxmem qube. slower CPUs will benefit more than faster ones.
Next is this:
[ 2.327810] xen:balloon: Waiting for initial ballooning down having finished. [ 3.058150] xen:balloon: Initial ballooning down finished.
So looks like if you don’t do the “don’t include in memory balancing” option, it’s not used and you get corresponding savings at the expense of fixed amount of RAM used by a qube.
so you basically don’t load this balloon driver and get the savings from its init. Somebody could look into what the init does and actually optimize it I guess. Again slower CPUs benefit more here. but other factors seem to have big impact too. free xen ram?
[ 1.309210] xen:balloon: Waiting for initial ballooning down having finished. [ 7.153167] xen:balloon: Initial ballooning down finished.
Now after that all is done what could be observed is it takes another long period (again slower CPUs = longer, gen8 i7 cpu takes two seconds) of 1-2 seconds to run initrd (from the time we see the init message in kernel log to the time we see systemd message):
[ 7.160447] Run /init as init process [ 8.707752] systemd: systemd 253.4-1.fc38 running in system mode (+PAM +AU DIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 + PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XK BCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
part of this is detection of block devices and other init, but another easy part to shave a bit of time (esp. on slower CPUs) is to uncompress the initrd.
That shaves another half a second or so on my 8th get i7:
[ 5.461441] Run /init as init process [ 6.611702] systemd: systemd 253.4-1.fc38 running in system mode (+PAM +AU DIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 + PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XK BCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
and after that you can do
systemd-analyze to make sure the rest of the init takes less than a second. Which in case of fedora dispvm is most likely does not, it probably takes more than 18 seconds in userspace init, and if you use
systemd-analyze blame you’ll discover that unbound-anchor service is taking whopping 17 seconds. This service is doing some dnssec related init and in majority of cases it’s not really important, so her, disable it in most of templates (obviously perform your own risk assessment, don’t trust me) and you’ve just shaved another 17 seconds off the time. Alternatively you can run your dvm template with network access every few days so the cache is current.
This brings startup inside the VM pretty low for me, but the externally observable time is still slow (like ~25 → ~15 seconds from button press to firefox window opening) so I guess I’ll now have to see on the xen/dom0 side what’s up there. Ideally I really want to reach sub 5 seconds for firefox window as the first step.