I didn’t get much of a response but that was my fault for burying it as part of a list of side-issues in a post about using zfs on qubes.
Well, now I know a LITTLE bit more about the zfs/qubes-networking incompatibility and it’s still puzzling.
If the qube that has both of these installed has no network vm, it starts up fine. You can then connect a network vm and things will work properly.
But if the network vm is set on qube startup, it freezes up for about sixty seconds then dies, claiming it cannot connect to the qrexec agent for 60 seconds. The log file referenced in the error popup appears to be effectively identical (timestamps, data rates, and uuid differ) in both cases, except that the one that succeeds in starting up ends with an announcement of the qube name and a login prompt. There’s no error message in that log file explaining that it didn’t connect to qrexec, or why.
So I have a very unsatisfactory workaround: don’t connect the sys-firewall qube as a network vm to any qube that has zfs capability, until after the zfs-capable qube has started. At this point I’d rather just not have zfs capability on those qubes. (I can still loop zfs blocks on those qubes and mount them on a zfs-capable qube with no networking and read them–which is what I normally do anyway.)
Packages: for zfs: zfs-zed and zfsutils-linux. For qubes-networking: qubes-core-agent-networking.
You can increase the qube’s qrexec_timeout and connect to its console using qvm-console-dispvm, maybe you’ll be able to access the console to see what failed to load.
I guess I don’t have any idea how to use qvm-console-dispvm. It opens a completely blank window I can type in, but nothing I type gets a reaction.
However, upping the qrexec and shutdown delays to 300 seconds revealed that the qube WILL start, after a bit over two minutes.
I have no idea what to look at to try to figure out what went on other than /var/log/xen/console.
/var/log/xen/console/guest-my-new-qube.log has a message that apparently (it’s hard to read because someone stuffed the file full of color escape sequences) systemd-ud didn’t complete device initialization. Then on the next lines there are errors installing the zfs kernel module and importing zfs pools–these messages are output after the two minute delay.
Those three messages do not show up in the log for a qube that has zfs but no qubes-core-agent-networking installed.
However, running systemd-analyze critical-chain on the qube shows that the delay seems to have been in local-fs-pre.target It started at 321 ms; the next command run-credentials-systemd\x2dtmpfiles\x2dsetup.service.mount1 seems to have started at 2min 676ms.
This is ultimately based off of a mimimal template, by the way.
Checking the log file (thanks for the command!) one minute in the following is reported (note, I’m typing this in whilst reading it in the terminal; typos more than likely):
localhost (udev-worker)[320]: eth0: Spawned process '.usr/bin/systemctl restart --job-mode=replace qubes-network-upling@eth0.service' [629] is taking longer than 59s to complete'
systemd=udevd[276]: eth0: Worker [320] processing SEQNUM=1958 is taking a long time
a minute later:
localhost udevadm[267]: Timed out for waiting the udev queue being empty.
localhost systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE
localhost systemd[1]: systemd-udev-settle.service: Failed with result 'exit-code'.
localhost systemd[1]: Failed to start systemd-udev-settle.service - Wait for udev To Complete Device Initialization.
localhost systemd[1]: Dependency failed for zfs-load-module.servie - Install ZFS kernal module.
localhost systemd[1]: Dependecy failed for zfs-import-cache.service - Import ZFS pools by cache file.
localhost systemd[1]: zfs-import-cache.service: Job zfs-import-cache.service/start failed with result 'dependency'.
localhost systemd[1]: zfs-load-module.service: Job zfs-load-module.service/start failed with result 'dependency'.
localhost systemd[1]: Reached target zfs-import.target - ZFS pool import target.
localhost systemd[1]: zfs-mount.service - Mount ZFS filesystems was skipped because of an unmet condition check (ConditionPathIsDirectory=/sys/module/zfs).
After this it looks routine, save for further complaints from ZFS.
It seems to be a problem with zfs systemd service requiring systemd-udev-settle.service:
qubes-network-uplink.service requires network-pre.target to be reached, which requires local-fs.target to be reached.
Zfs service zfs-mount is set to run before local-fs.target.
But zfs-mount needs other zfs services (zfs-import-cache zfs-import-scan zfs-load-module) to load that require systemd-udev-settle.service to start, which can’t start before qubes-network-uplink.service is finished loading.
So it’s a dependency mess.
No idea how to solve it properly. except for overriding the zfs systemd units and removing the Requires=systemd-udev-settle.service from them:
find /lib/systemd/system/zfs* -type f -exec grep -q '^Requires=systemd-udev-settle.service' {} \; -exec cp -t /etc/systemd/system {} +
find /etc/systemd/system/zfs* -type f -exec sed -i "/Requires=systemd-udev-settle.service/s/^#*/#/" {} \;
This appears to work…at least when I ran those commands (with ‘sudo’) in the template, shut down the template and restarted the AppVM it came right up, and the network was working (was able to ping something).
Thanks! Systemd is an impenetrable black box to me.
EDIT: no, this is insufficient. Not sure why it “worked” the first time I tried it but I can’t get it to work now.
Do you have the same errors?
Check that zfs service files exist in /etc/systemd/system/ and they have Requires=systemd-udev-settle.service commented out inside.
Apparently when I updated my salt file both commands were on /lib/systemd/system/zfs*
The second one should have been on /etc/systemd/system/zfs*. I am thinking I must have manually entered the command properly (which is why it worked “the first time”) then botched the salt file.
Salt now running (it will take time); I expect it will work this time; if so I will re-check your comment as the solution.