Debian appvm with zfs fails to start

JanErik · October 27, 2023, 2:50pm

Update: An appvm based on the debian-12-minimal template works fine: I’m able to start, restart, and access my zfs filesystem.

My goal is to have an appvm with zfs and ssh-server. I am using 4.2 rc3 with updates

A template based on debian-12-xfce with zfs installed starts and restarts without complaint.

I created a new appvm based on this template using the ‘Create New Qube’ system tool. I verified in the settings that the kernel is set to pvgrub2-pvh. The appvm fails to start with the error message ‘Cannot connect to qrexec agent for 60 seconds: …’

What appears to be the relevant part of the log seems to be that there is an error with libglib-2.0:
[2023-10-26 21:34:40] Starting .[0;1;39mqubes-gui-agent.service.[0m - Qubes GUI Agent…
[2023-10-26 21:34:40] Starting .[0;1;39mqubes-qrexec-agen…e.[0m - Qubes remote exec agent…
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Started .[0;1;39mserial-getty@hvc0.service.[0m - Serial Getty on hvc0.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Reached target .[0;1;39mgetty.target.[0m - Login Prompts.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Started .[0;1;39mqubes-gui-agent.service.[0m - Qubes GUI Agent.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Started .[0;1;39mqubes-qrexec-agent…ice.[0m - Qubes remote exec agent.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Finished .[0;1;39mqubes-misc-post.s…0m - Qubes misc post-boot actions.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Started .[0;1;39mxendomains.service… Start/stop secondary xen domains.
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Reached target .[0;1;39mmulti-user.target.[0m - Multi-User System.
[2023-10-26 21:34:40] Starting .[0;1;39msystemd-update-ut… Record Runlevel Change in UTMP…
[2023-10-26 21:34:40] [.[0;32m OK .[0m] Finished .[0;1;39msystemd-update-ut… - Record Runlevel Change in UTMP.
[2023-10-26 21:34:40] [ 5.654395] fbcon: Taking over console
[2023-10-26 21:34:41]
[2023-10-26 21:34:41] Debian GNU/Linux 12 zfs hvc0
[2023-10-26 21:34:41]
[2023-10-26 21:34:41] zfs login: [ 7.997875] show_signal: 13 callbacks suppressed
[2023-10-26 21:34:42] [ 7.997879] traps: light-locker[903] trap int3 ip:7115b93037d7 sp:7ffc0b9d3b70 error:0 in libgli
b-2.0.so.0.7400.6[7115b92c5000+8d000]
[2023-10-26 21:34:51] [ 16.603372] systemd-journald[254]: Time jumped backwards, rotating.
[2023-10-26 21:44:48] Stopping .[0;1;39msession-c3.scope.[0m - Session c3 of User user…
[2023-10-26 21:44:48] [.[0;32m OK .[0m] Removed slice .[0;1;39msystem-modpr…lice.[0m - Slice /system/modprobe.
[2023-10-26 21:44:48] [.[0;32m OK .[0m] Stopped target .[0;1;39mmulti-user.target.[0m - Multi-User System.

Any suggestions would be appreciated.

apparatus · October 27, 2023, 3:31pm

Try to increase the initial memory for this appvm.

JanErik · October 27, 2023, 4:28pm

I increased from 400 M to 2000 M and get the same message.

Thanks for the suggestion.

JanErik · October 27, 2023, 4:48pm

Additional troubleshooting background:

From a fresh debian-12-xfce template:

An appvm created from template after installing distribution kernel in template starts and restarts without complaint

An appvm created from template after also installing dkms also starts and restarts.

An appvm created from template after also installing zfs doesn’t start

JanErik · October 27, 2023, 5:16pm

Update:

Starting with the debian-12-minimal update I am able to install the distribution kernel, dkms, and zfs.

An appvm created from the resulting minimal template starts and zfs works.

dcartier · February 17, 2024, 6:59pm

I have run in to the same issue. Installing the dkms and zfs into a debian-12-xfce template results in a working ZFS system when booting up the template, but not for an AppVM based on the template.

I get the following lines repeated over and over until the VM gets killed because the qrexec agent could not be contacted for 60 seconds.

.[K[.[0m.[0;31m* .[0m] Job systemd-udev-settle.service/start running (11s / 3min)

[2024-02-17 11:49:22] .M
.[K[.[0;1;31m*.[0m.[0;31m* .[0m] Job systemd-udev-settle.service/start running (11s / 3min)

[2024-02-17 11:49:22] .M
.[K[.[0;31m*.[0;1;31m*.[0m.[0;31m* .[0m] Job systemd-udev-settle.service/start running (12s / 3min)

[2024-02-17 11:49:23] .M
.[K[ .[0;31m*.[0;1;31m*.[0m.[0;31m* .[0m] Job systemd-udev-settle.service/start running (13s / 3min)

[2024-02-17 11:49:24] .M
.[K[ .[0;31m*.[0;1;31m*.[0m.[0;31m*.[0m] Job systemd-udev-settle.service/start running (14s / 3min)

[2024-02-17 11:49:25] .M
.[K[ .[0;31m*.[0;1;31m*.[0m] Job systemd-udev-settle.service/start running (14s / 3min)

[2024-02-17 11:49:25] .M
.[K[ .[0;31m*.[0m] Job systemd-udev-settle.service/start running (15s / 3min)

I narrowed it down to being the systemd-udev-settle service that starts and dies over and over. Looking into this service, I found that ZFS is pulling it in, though it’s use has been discouraged since 2011 and it has been deprecated as of 2018. There is an open issue filed on ZFS to remove it from ZFS.

To allow me to boot the AppVM, I simply masked the service.

sudo systemctl mask systemd-udev-settle

That seems to have resolved the issue and allows the AppVM to start. As far as I can see, no effect on ZFS, but I have just made the change, so I will update if any unexpected side effects become apparent.

dcartier · February 19, 2024, 1:34pm

I have noticed a situation upon VM startup, but I am not sure if it is related to the masking of the settle service or perhaps just the way ZFS acts when used in an AppVM.

When I started the AppVM, zpool reports no pools present. I have to do a zpool import -a for the pool to be recognized and appear. I am wondering if this is caused by the need to (manually) attache the pool’s devices after the AppVM started up, so ZFS has already scanned the (non-existent) devices on AppVM startup and found no pools, or if the masked service plays a part in detecting pools on newly added devices.

ZFS newbie here, so any experienced uses that can shed light on this subject, I would appreciate it.

dcartier · February 19, 2024, 4:18pm

After doing a bit more debugging, it looks like the work around of masking the settle service is the cause of the pools not being imported.

Here is the relevant portion of the logs.

Feb 18 18:18:46 localhost kernel: ZFS: Loaded module v2.2.2-4~bpo12+1, ZFS pool version 5000, ZFS filesystem version 5
Feb 18 18:18:46 localhost systemd[1]: Finished zfs-load-module.service - Install ZFS kernel module.
Feb 18 18:18:46 localhost systemd[1]: zfs-import-cache.service - Import ZFS pools by cache file was skipped because of an unmet condition check (ConditionFileNotEmpty=/etc/zfs/zpool.cache).
Feb 18 18:18:46 localhost systemd[1]: Reached target zfs-import.target - ZFS pool import target.
Feb 18 18:18:46 localhost systemd[1]: Starting zfs-mount.service - Mount ZFS filesystems…
Feb 18 18:18:46 localhost systemd[1]: Starting zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev…
Feb 18 18:18:46 localhost systemd[1]: Finished zfs-mount.service - Mount ZFS filesystems.
Feb 18 18:18:46 localhost zvol_wait[484]: No zvols found, nothing to do.
Feb 18 18:18:46 localhost systemd[1]: Reached target local-fs.target - Local File Systems.
Feb 18 18:18:46 localhost systemd[1]: Starting apparmor.service - Load AppArmor profiles…
Feb 18 18:18:46 localhost systemd[1]: Starting qubes-early-vm-config.service - Early Qubes VM settings…
Feb 18 18:18:46 localhost systemd[1]: Starting systemd-binfmt.service - Set Up Additional Binary Formats…
Feb 18 18:18:46 localhost systemd[1]: systemd-machine-id-commit.service - Commit a transient machine-id on disk was skipped because of an unmet condition check (ConditionPathIsMountPoint=/e>
Feb 18 18:18:46 localhost systemd[1]: Starting systemd-tmpfiles-setup.service - Create Volatile Files and Directories…
Feb 18 18:18:46 localhost systemd[1]: Finished zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev.
Feb 18 18:18:46 localhost systemd[1]: Reached target zfs-volumes.target - ZFS volumes are ready.

This is being caused by the settle service not running and the zfs-import-cache.service not importing the pool as a result. The PR filed in ZFS mentions this service specifically as they only service that relies on the settle service.

The only downside I can see from having to do a manual import is that the zpool history output shows each of the imports that continue to be added with every reboot of the AppVM.