TemplateVMs and StandaloneVMs on secondary pool cannot connect to qrexec agent

bengee · December 17, 2024, 12:35am

Using the user docs for creating secondary storage, I created an additional encrypted storage pool on a internal hard drive to offload the majority of appVMs and preserve my ssd space. This works fine and using the lsblk command I can see that the appVMs of any type go correctly to the assigned hdd-disk pool. And the qvm-pool list command lists both pools as well.

However, when I create a templateVM or standaloneVM on the hdd-disk pool they are unable to start and present the: cannot connect to qrexec agent error message. Cloning had the same result in what I tested. I also tried playing with more memory on those VMs but these failed with the same result. If I duplicate my VM creation but create it on the default vm-pool these will now start without issue. And with default settings ie memory etc. so memory doesn’t appear to be an issue for that qrexec agent error at least in these circumstances.

It appears the only appVMs that are located on the hdd-pool and that start without error are the type: appVM (persistent home, volatile root). While the majority of my appVMs are of that type, I would like to create as many of the VMs as possible on the hard drive which has oogles of space available while my ssd is limited.

I checked out the logs of the associated qubes getting the failure and error and there appears to be an attempt to chroot to a NEW_ROOT. Shortly after is an attempt to reopen stdio to DEV and then a Kernel panic - not syncing error. Game over. The log also states switch_root Not tainted 6.6.63-1.qubes.fc37.x86_64. Then it throws a trace. And I would guess the qrexec agent error appears.

I do not know enough about how this works as I’m fairly new to both Linux and totally green with Qubes OS. It seems Qubes is either missing some config to allow the operation of the templateVM and standaloneVMs with this setup configuration, or that Qubes is still coded somewhere to use the default-pool for templates and standalones regardless of the disk-pool. It lists both pools being available when creating in the gui but seems missing something that actually allows it. Again–the appVM type: (persistent home, volatile root) all work fine from the hdd-pool with no issues. And yes I am aware of the Caution note under the advanced tab about changing the settings and not starting, but if it works without issue for one appVM type why not all of them?

Anybody have an idea why the template and standalone VMs have issues versus the (persistent home, volatile root)? I mean it’s just disk space and if it works for one, then why not the other? Also, seeing the chroot in the logs makes it seem like the volatile root would be the type with the issue not the VMs that had persistent roots. Other than this QubesOS has been working just fine. Thanks anybody!

rustybird · December 17, 2024, 12:23pm

What storage driver are you using for the secondary pool? If it’s the deprecated ‘file’ driver, I’d retry with ‘file-reflink’ or ‘lvm_thin’ before debugging this any further.

bengee · December 18, 2024, 12:41am

Hi, thanks for the reply rustybird! I am using lvm_thin on both the vm-pool and the one I added (poolhd0_qubes).

bengee · December 18, 2024, 1:08am

Here is some info on the pool on the harddrive:

name poolhd0_qubes
driver lvm_thin
ephemeral_volatile False
revisions_to_keep 2
size 3000386977792
thin_pool poolhd0
usage 81910564493
volume_group qubes_hd0

rustybird · December 18, 2024, 11:17am

Can you attach the logs?

bengee · December 18, 2024, 3:51pm

Sure here is the log for when attempting to start a templateVM on the hd pool. The standaloneVM log is the same but I can add that if needed. I have also tried previously with Debian templates with same results if I recall correctly:

guest-test-templateVM-fed41.log (40.9 KB)

rustybird · December 18, 2024, 4:47pm

Smells like a 4k storage problem: Support 4k storage · Issue #4974 · QubesOS/qubes-issues · GitHub

Could you post the output of sudo blockdev --getpbsz --getss DEVICE in dom0 for your secondary hard disk device, and then (assuming that you’ve set up encryption) also for the corresponding /dev/mapper/ LUKS device?

bengee · December 18, 2024, 6:35pm

yep, after glancing through the GitHub thread it does seem possibly related…sorry it took a bit to check things out

Secondary storage-hard drive:
sudo blockdev --getpbsz --getss /dev/sdb
4096
512

result of same command on luks device:
4096
4096

Primary drive - ssd:
output of block and sector size on the primary drive (ssd) is the same as the secondary (hd):
sudo blockdev --getpbsz --getss /dev/sda
4096
512

but the block and sector size on the luks volume on the primary drive matches the block and sector size of the primary drive:
4096
512

I’m not knowledgeable about cryptsetup and just used defaults and what was listed in the ‘Secondary storage’ forum notes. Or is this an lvm thing and not cryptsetup config concern?

I didn’t catch the resolution regarding the 4k sector thread topic. Since the QubesOS set up the encrypted partition on my ssd, is a possible solution then: to do a backup and recreate the hdd layout to match the ssd?

Like I mentioned earlier, I’m still learning…and your assistance is greatly appreciated. Thanks!

rustybird · December 18, 2024, 6:41pm

cryptsetup luksFormat --sector-size=512 should make your secondary drive work with lvm_thin. That’s how the installer does it. (Luckily your drives have 512 byte logical sectors and only their physical sector size is 4096 bytes, otherwise even that cryptsetup option wouldn’t help and you’d have to use ‘file-reflink’ instead of ‘lvm_thin’ at the moment.)

bengee · December 18, 2024, 8:13pm

Ok. I can try that. Being a need for “luksFormat”; and looking into it and finding the following:

You can only call luksFormat on a LUKS device that is not mapped.

Assuming the note is just referring to actually formatting the device not just changing sector-size:

Is this done in ‘plain’ mode and using the /dev/mapper/luks name; or do I need to do a luksClose prior and then use the luksFormat on /dev/sdb?
is this doable with data existing without loss of data; I’m assuming no data loss since if in ‘plain’ mode the device is mapped?

rustybird · December 18, 2024, 9:02pm

luksFormat would indeed destroy the existing data! Sorry, I had assumed that there was no important data yet and that you were just trying out the secondary pool.

do I need to do a luksClose prior and then use the luksFormat on /dev/sdb?

That’s right, but:

is this doable with data existing without loss of data

In that case, you want cryptsetup reencrypt --sector-size=512 instead. It will take a while though because it has to rewrite the whole disk, even unused space.

bengee · December 18, 2024, 9:30pm

Awesome, glad I clarified!!

Yes, I have 5 or 6 ‘daily use’ vms out there because most of them are of type: appVM (persistent home, volatile root) and they worked great. Was converting things slowly to use QubesOS as my daily driver. Since those were working without issue I started adding different types and that’s when I hit the snag.

Please let me know if you see possible issues with my test plan.

My plan will be:

I’ll do another backup, just in case.
then maybe clone the most important appVMs back to the default vm-pool temporarily (and leave them off)
do a luksClose on the hdd
run the crypsetup reencrypt --sector-size=512 /dev/sdb
do a luksOpen on the hdd
check the result with the blockdev command
if no issues, test the previously appVMs if they still start on the hdd-pool as they did previously
if all ok with appVMs on hdd, delete the cloned VMs I made on the default vm-pool
test templateVM and standaloneVM to be created on the hdd-pool to see if problem resolved.
report back results

rustybird · December 18, 2024, 9:37pm

Sounds good!

bengee · December 18, 2024, 9:40pm

Ok! might be a bit before I get back with results.

Thanks again for your assistance and direction!!

bengee · December 19, 2024, 6:38pm

Hey @rustybird You were 100% correct! The 4k issue appears to have been the problem. I am now able to create both templateVMs and standaloneVMs on my secondary hdd and they start up fine and no qrexec agent error!!

I can not say ‘thanks’ enough–that was really bugging me.

Side note: During my test procedure though I may have inadvertently discovered a bug in the gui when doing clones. (Or at least not how one would think it would work). I found that my attempt to temporarily backup the VMs on the hdd, in case of issues with cryptsetup, by cloning them over to the ssd vm-pool did not actually work. It appears even though under the advanced tab and even though I had selected the ssd default vm-pool to clone it to – it cloned them right back on the hd0-pool exactly where the original was. lsblk verified this for me.

Being that I also had a backup (probably should have tested the backup though ) I continued on with the reencrypt. I might also add that if the clones had actually cloned over to the ssd, I could have optionally just used the luksFormat and then recloned them back to the hdd after the format with new sector-size. Might have saved a little time maybe. (I know you already know all this, but thought the extra info might help someone else down the line).

Perhaps someone might put a note near the secondary storage info regarding the 4k and sector-size issue? Secondary Storage info They could handle the sector-size right away rather than the path I ended up on. Or maybe this is addressed in a future OS release. I am running r4.2.3.

Also if someone might mention the ‘gui’ clone glitch to the QubesOS developers they might have a look at it for a future update??

Thanks Again!!

rustybird · December 20, 2024, 2:08pm

Thanks, should be resolved soon:

In R4.3, a VM’s early boot environment adjusts the ‘root’ volume’s partition table for the correct sector size, so in most cases it’s no longer be a problem if a storage driver exposes 4k sectors. (Not sure if this feature is going to be backported to R4.2.)