Cannot Connect to Qrexec Agent

huaopeng · May 24, 2023, 8:45pm

Trying to boot a StandaloneVM but getting an error message:

qrexec-daemon.c:1214.main: qrexec-agent has disconnected domain dead
qrexec-daemon.c:1106:handle_agent_restart: cannot connect to qrexec agent: No such process

tenchiki · May 24, 2023, 9:00pm

I also get this error when trying to start any domain based on fedora-37 or archlinux. dom0 works, as do any domains based on debian-11. This is after the latest updates to dom0, and trying to start the updates to fedora-37 based templates. Interestingly, sys-usb starts, however sys-net and any other domains based on fedora-37 or it’s clones fails with the error messages above. I have tried rebooting several times without success. qvm-console-dispvm fails with the same error. Maybe I can try using a debian-11-based dispvm? (edit: it doesn’t get that far before failing).

The actual message from /var/log/qubes/qrexec.sys-net.log:
2023-05-24 13:46:32.818 qrexec-daemon[11145]: qrexec-daemon.c:340:init: cannot connect to qrexec agent: No such file or directory
(hopefully no typos when copying it).

Now how do I get out of this state without reinstalling?

huaopeng · May 24, 2023, 11:48pm

That’s interesting, in my case I had no issue with sys-net or sys-firewall. I think I will submit a Github issue for the developers

DVM · May 25, 2023, 7:17pm

Do you see anything in /var/log/xen/console/guest-<qubename>.log ?

tenchiki · May 25, 2023, 9:12pm

That is an insanely long log file (/var/log/xen/console/guest-fedora-37.log)… Looks like boot log on max verbosity (which is probably good for this issue…). It ends with ‘Reached target Network’ and ‘Reached target nss-lookup.service’.
In the failed to start case (Fedora-37 template specifically), I get many lines logged of “Dependency failed for qubes-qrexec-agent.service” and “Job qubes-qrexec-agent.service/start failed with result ‘dependency’”. Anything specific I should look for?
I really hate that systemd never seems to log what caused the failure, when it mostly has that information…

Oh, this is Qubes-4.1 current updates.
Hmmm, I made a clone of fedora-37 when I installed it to install some more applications. that one is now working…

huaopeng · May 25, 2023, 9:29pm

Are you saying its a fedora-37 issue which started after the dom0 4.1 updates?

tenchiki · May 25, 2023, 11:23pm

yes. (archlinux problems started earlier, though)

DVM · May 25, 2023, 11:43pm

What is the number of vcpu and ram you allocated to the qube that doesn’t work?

marmarek · May 25, 2023, 11:51pm

What is the first thing (or first few) that failed there?

tenchiki · May 25, 2023, 11:54pm

The laptop has 4 physical CPU cores, with disabled hyperthreading. Like a lot of the qubes, it is left at the defaults: 400/4000MB and 2 vCPUs. I left the templates at defaults. Just a reminder, this is the Fedora-37 template, and any qubes based on it also show the same error.

tenchiki · May 26, 2023, 12:16am

BTW, using kernel default 5.15.89-1.fc32.

Getting lots of fails that I wouldn’t expect (haven’t compared to a working qube yet): the first fail: “Fast TSC calibration failed”, (why the heck is Modem Manager even installed on these templates? of course it will fail…), “systemd-random-seed.service: control process exited”, “Dependency failed for sysinit.target"Dependency faild for dbus.socket”, “Dependency failed for systemd-logind.service”, “Dependency failed for multiuser.target”. grep finds 356 lines with “fail” in one boot attempt, compared to 29 on a successful boot.

The first logged line of qrexec failure is: “Dependency failed for qubes-qrexec-agent.service”

Kinda looks like something failed early that everything else depends on. I wonder if something got corrupted during an earlier update?

marmarek · May 26, 2023, 11:41am

This looks to be related, and as you identified, that’s an early service that (indirectly) a lot of others depends on. Do you see any other message about random seed around this line? Or some failures about “qubes-db” or “Qubes DB”? Maybe you can upload log of the whole failed boot (if you don’t have any sensitive info there)?

huaopeng · May 26, 2023, 4:19pm

systemd-random-seed.service is a service that loads an on-disk random seed into the kernel entropy pool during boot and saves it at shutdown

When loading the random seed from disk, the file is immediately updated with a new seed retrieved from the kernel,
in order to ensure no two boots operate with the same random seed

tenchiki · May 26, 2023, 6:10pm

This is extra weird: “Failed to open /usr/lib/systemd/system/qubes-db.service: Bad message”
guest-fedora-37.log (1.1 MB)

Yeah, it is probably faster for someone knowledgeable to look at the whole log…

thanks!!!

huaopeng · May 26, 2023, 6:15pm

Could you kindly post a few more lines if you don’t mind?

marmarek · May 26, 2023, 7:46pm

[2023-05-25 13:41:37] [    4.221858] EXT4-fs error (device xvda3): ext4_lookup:1836: inode #138753: comm systemd: iget: checksum invalid
[2023-05-25 13:41:37] [    4.242921] EXT4-fs error (device xvda3): ext4_lookup:1836: inode #138753: comm systemd: iget: checksum invalid

This looks like corrupted filesystem. There could be automatic fsck call, but it seems we remount root fs read-write too early

[    4.507349] systemd[1]: systemd-fsck-root.service - File System Check on Root Device was skipped because of a failed condition check (ConditionPathIsReadWrite=!/).

Maybe using in-vm kernel and initramfs will make it call fsck? Try setting hte the VM kernel to “pvgrub2-pvh”, or if you don’t have this option, then to “provided by qube” and change virtualization mode to HVM.
If that wouldn’t make fsck work, then you’ll probably need to run fsck manually. See here how to attach qube’s volume to a disposable qube (since you have debian working, you can probably use disposable based on debian).

tenchiki · May 26, 2023, 9:38pm

I thought it might be corruption somewhere… I just didn’t search for the correct thing.

I had to manually FSCK from a disposable qube, and that worked!

thanks much marmarek!

Now to figure out why archlinux boots and all subsequent qrexec’s time out… (another day)