Freshly created VMs fail with "qrexec-daemon startup failed"

Recently, my fedora-32 (still have to upgrade that one) template developed a strange issue.
The TemplateVM itself starts just fine, VMs depending on it that have been created a long time ago also run perfectly fine.

However, if I create a new VM which depends on the fedora-32 template, this one won’t start and instead show the well-known error messages:

Domain test-f32-issue has failed to start: qrexec-daemon startup failed: Connection to the VM failed

/var/log/xen/console/guest-test-f32-issue.log does not show anything obvious. The startup process runs until about 4 seconds in and the last line reads

input: PC Speaker as /devices/platform/pcspkr/input/input0

I compared this with a VM that starts up fine in which the following lines are (roughly summarized)

- Initialising Xen virtual ethernet driver
- Found device /dev/hvc0
- Found device /dev/xvdc1
- Activating swap
- Mounting tTemporary Directory

Since the problem happens both with and without networking, I would guess that the ethernet driver is not the issue. Could it be that something with setting up the file system is wrong? Why would that happen only in new VMs? Could it have something to do with the private disk space unique to each VM?

It would be great if anyone had some insights into how I could debug this. I would prefer not to throw away the template because I customized it over the time.

Edit: In the meantime, I upgraded the template to fedora-33 following the usual update instructions. This did not change anything (behavior as previously described).

However, I noticed the following: If I create a new AppVM, it cannot start with my custom fedora template. As soon as I switch it over to any other template, it starts up without any issues. If I then switch back to fedora, this one suddenly works as well. This works even if I use another distro for the initial startup, such as Debian 10.

It seems that there is anything that needs to be setup once in order for the VM to get into a permanently working state.

I’ve seen this before when my template ran out of disk space or was very low on it, and possibly when the default 2GB private volume had gotten into a bad state.

Also try increasing the qrexec_timeout for the AppVM that is timing out, qvm-prefs badappvm qrexec_timeout 300 in case there’s a long-running resize-rootfs job.

I already had disk space in mind but that shouldn’t be an issue.
The disk usage widget in dom0 shows that there is more than 200GB of space left in the LVM, so it’s definitely not about absolute disk space.

The template in question is defined with

  • 30720MB of System storage, of which 17GB are used according to df -h inside the template.
  • 10240MB of Private storage, of which 541MB are used.

The only thing I can see is that for some reason, I increased the default private storage size of the template itself to 10GB instead of the default 2GB. However, as the capacity is not used I don’t see how this could be relevant.
Nevertheless, I did a quick test with a clean (downloaded from the repository) fedora template, increased its private storage and created an AppVM from it. This worked without any issues.

qrexec timeout is also not the culprit. The error message comes much quicker than the 60 seconds timeout (about ~5 seconds after starting the VM maybe). To be sure, I tried creating an AppVM with increased timeout values, but that also didn’t change anything.

Thanks for your ideas @icequbes1, but this did not help :confused: .

Still living with my not-so-nice workaround.
Anyone else having any idea how I could fix/debug the underlying issue?

The first thing I do when having troubles with something in Qubes I search qubes-issues. In your case I’d search for “qrexec-daemon startup failed”. Don’t forget to look at the closed issues as well.

If you don’t find a similar issue (and a possible solution) I’d try with a fresh template via sudo qubes-dom0-update qubes-template-fedora-33

Also, I’d think about keeping a virgin template and cloning it before installing any software in it. This way you can always go back to the fresh template and clone it again when something breaks or doesn’t work anymore or just for testing purposes.

Like @icequbes1 said before, I also experienced trouble with different qubes when running out of space in the dom0. In the Disk Space Monitor you should find varlibqubes and dom0 (don’t know the exact term). If you hadn’t changed the default setting it should be 20GB and not 200GB. Are you sure you have 200GB in dom0 left?

I had similar issue on Qubes 4.0.4…
My disk ran out of space (filled up to 99.9%)…and wouldn’t load any VM’s that were not already open (error msg = qrexec-daemon startup failed)

I freed up space (down to 85%)…BUT still none of my VM’s would boot
Issue persisted after reboot…and now not even my usb-net HVM (using USB wifi adaptor) or even template VM’s would boot (same error as above)…

Next tried to set my usb-net HVM to “run in debug mode” (via qube settings)
Terminal window showed no errors and it booted fine…
closed it, unchecked debug mode…all VM’s now working as prior without issue

not sure how to explain this,
thanks to eveyone! esp developers, u guys make my hero list :smiley: