Cannot Connect to Qrexec Agent

Trying to boot a StandaloneVM but getting an error message:

qrexec-daemon.c:1214.main: qrexec-agent has disconnected domain dead
qrexec-daemon.c:1106:handle_agent_restart: cannot connect to qrexec agent: No such process

I also get this error when trying to start any domain based on fedora-37 or archlinux. dom0 works, as do any domains based on debian-11. This is after the latest updates to dom0, and trying to start the updates to fedora-37 based templates. Interestingly, sys-usb starts, however sys-net and any other domains based on fedora-37 or it’s clones fails with the error messages above. I have tried rebooting several times without success. qvm-console-dispvm fails with the same error. Maybe I can try using a debian-11-based dispvm? (edit: it doesn’t get that far before failing).

The actual message from /var/log/qubes/qrexec.sys-net.log:
2023-05-24 13:46:32.818 qrexec-daemon[11145]: qrexec-daemon.c:340:init: cannot connect to qrexec agent: No such file or directory
(hopefully no typos when copying it).

Now how do I get out of this state without reinstalling?

That’s interesting, in my case I had no issue with sys-net or sys-firewall. I think I will submit a Github issue for the developers

Do you see anything in /var/log/xen/console/guest-<qubename>.log ?

That is an insanely long log file (/var/log/xen/console/guest-fedora-37.log)… Looks like boot log on max verbosity (which is probably good for this issue…). It ends with ‘Reached target Network’ and ‘Reached target nss-lookup.service’.
In the failed to start case (Fedora-37 template specifically), I get many lines logged of “Dependency failed for qubes-qrexec-agent.service” and “Job qubes-qrexec-agent.service/start failed with result ‘dependency’”. Anything specific I should look for?
I really hate that systemd never seems to log what caused the failure, when it mostly has that information…

Oh, this is Qubes-4.1 current updates.
Hmmm, I made a clone of fedora-37 when I installed it to install some more applications. that one is now working…

Are you saying its a fedora-37 issue which started after the dom0 4.1 updates?

yes. (archlinux problems started earlier, though)

What is the number of vcpu and ram you allocated to the qube that doesn’t work?

What is the first thing (or first few) that failed there?

The laptop has 4 physical CPU cores, with disabled hyperthreading. Like a lot of the qubes, it is left at the defaults: 400/4000MB and 2 vCPUs. I left the templates at defaults. Just a reminder, this is the Fedora-37 template, and any qubes based on it also show the same error.

BTW, using kernel default 5.15.89-1.fc32.

Getting lots of fails that I wouldn’t expect (haven’t compared to a working qube yet): the first fail: “Fast TSC calibration failed”, (why the heck is Modem Manager even installed on these templates? of course it will fail…), “systemd-random-seed.service: control process exited”, “Dependency failed for sysinit.target"Dependency faild for dbus.socket”, “Dependency failed for systemd-logind.service”, “Dependency failed for multiuser.target”. grep finds 356 lines with “fail” in one boot attempt, compared to 29 on a successful boot.

The first logged line of qrexec failure is: “Dependency failed for qubes-qrexec-agent.service”

Kinda looks like something failed early that everything else depends on. I wonder if something got corrupted during an earlier update?

This looks to be related, and as you identified, that’s an early service that (indirectly) a lot of others depends on. Do you see any other message about random seed around this line? Or some failures about “qubes-db” or “Qubes DB”? Maybe you can upload log of the whole failed boot (if you don’t have any sensitive info there)?

systemd-random-seed.service is a service that loads an on-disk random seed into the kernel entropy pool during boot and saves it at shutdown

When loading the random seed from disk, the file is immediately updated with a new seed retrieved from the kernel,
in order to ensure no two boots operate with the same random seed

This is extra weird: “Failed to open /usr/lib/systemd/system/qubes-db.service: Bad message”
guest-fedora-37.log (1.1 MB)

Yeah, it is probably faster for someone knowledgeable to look at the whole log…

thanks!!!

Could you kindly post a few more lines if you don’t mind?

[2023-05-25 13:41:37] [    4.221858] EXT4-fs error (device xvda3): ext4_lookup:1836: inode #138753: comm systemd: iget: checksum invalid
[2023-05-25 13:41:37] [    4.242921] EXT4-fs error (device xvda3): ext4_lookup:1836: inode #138753: comm systemd: iget: checksum invalid

This looks like corrupted filesystem. There could be automatic fsck call, but it seems we remount root fs read-write too early :frowning:

[    4.507349] systemd[1]: systemd-fsck-root.service - File System Check on Root Device was skipped because of a failed condition check (ConditionPathIsReadWrite=!/).

Maybe using in-vm kernel and initramfs will make it call fsck? Try setting hte the VM kernel to “pvgrub2-pvh”, or if you don’t have this option, then to “provided by qube” and change virtualization mode to HVM.
If that wouldn’t make fsck work, then you’ll probably need to run fsck manually. See here how to attach qube’s volume to a disposable qube (since you have debian working, you can probably use disposable based on debian).

I thought it might be corruption somewhere… I just didn’t search for the correct thing.

I had to manually FSCK from a disposable qube, and that worked!

thanks much marmarek!

Now to figure out why archlinux boots and all subsequent qrexec’s time out… (another day)