About btrfs with compression in Qubes OS

logoerthiner · July 8, 2023, 10:08am

I am trying btrfs with Qubes OS R4.2 but I think that the questions here are not related to qubes release.

First question. I tried to enable compression for btrfs partition by editing /etc/fstab and append ,compress=zstd to the mount options for the btrfs but I have not rebooted the machine yet. I wonder whether this is suggested method or are there any pitfalls here?

Second question. I want to know the best practice to defragment template vm root.

In btrfs setup of Qubes OS, the root images and private images are merely sparse files in the btrfs partition.

And for each template VM, its root image seems to be cp --reflinked into every appvm using the template.

According to btrfs documentation Defragmentation — BTRFS documentation , reflink will be broken on defragmentation. Also defragmentation with btrfs without defragmentation of inner ext4 file system seems unwise.

I am think about a procedure to defragment one template vm root filesystem
(1) power off every vm other than dom0
(2) set the template vm of every appvm so that no vm point to the template
(3) using some method to generate the defragmented version of its rootfs
(4) btrfs filesystem defrag -czstd filename
(5) then reset the template vm of those appvm to the template vm

is this viable? are there any suggestions on better methods?

Third question. What are suggestions about deduplication in Qubes OS? I see no default deduplication applications in dom0 installed by default.

Fourth question. What are suggestions for VM with large volumes, most of which are docker containers?

rustybird · July 8, 2023, 11:49am

That should work, but I haven’t tested it. Let us know

No need. Shutting down an AppVM automatically deletes its root volume snapshot.

I wouldn’t go as far as calling it unwise. You can definitely get a nice performance improvement just by defragmenting the outer Btrfs layer.

But it you want to do the inner filesystem first, ext4 supports online defrag: It’s sudo e4defrag / in the TemplateVM for the root volume, or sudo e4defrag /rw in the AppVM for the private volume. If you add -c to the e4defrag command it will (only) check if the inner defragmentation is even necessary. It will usually show very very low fragmentation for the root volume, because from the inside view it’s a filesystem containing a bunch of small files and there’s hopefully plenty of free space.

(Never run e4defrag or any other inner-filesystem utilities on the VM volumes in dom0)

Yes. In the context of a TemplateVM: If you have a clone of it, or a StandaloneVM created from it, defragmenting the root volume of the original TemplateVM (or of its clone, or of the StandaloneVM) will waste disk space.

These instructions also apply, but with root instead of private:

For the same reason, if you’ve made any whole-filesystem snapshots in dom0 (containing your VM volume image files) you may want to delete them before defragmentation.

Personally I never bother defragging my TemplateVMs: Just recreating them whenever a new version appears in qvm-template (instead of upgrading the distribution in-place) keeps fragmentation in check.

logoerthiner · July 8, 2023, 12:33pm

Happy to hear your advice!

I was just seeing that appVM removed all root.img files and the template folder having 3 different root.img’s (root.img, root-precache.img, root@abcd.img) when I find out your reply. This is a useful information!

I think template defragmentation is required rather than directly recreating them, since I will install many custom applications in the default template.

I have not reboot the machine yet. However I have tried a little. Just now I have tried to do btrfs filesystem defrag -czstd -v -r dir on /etc, /usr and for files /var/lib/qubes/vm-kernels/*/modules.img. Here is my experience until now:

/var/lib/qubes/vm-kernels/*/modules.img is by default mistaken by btrfs heuristics to be uncompressible. I need to manually set the compression flag chattr +c modules.img. After compression this file takes around 30% of the space. What a pity the file is not compressed by default.
The effect for /etc, /usr are just good - compression ratio for /etc is 26% and /usr 43%
I find it difficult, if not impossible, to check out whether compression takes place, without manually install compsize.

I still have no idea how to deduplicate though.

logoerthiner · July 8, 2023, 2:40pm

Actually when I read this sentence the 5th time I thought of this: actually calling compression on vm storage is itself a secure hole due to vm storage being “untrusted”, and so does compression=zstd on dom0 /etc/fstab which does exactly the same thing. However if I must choose, I now decide to choose the ssd age and boot speed instead of some hypothetical attack vector.

Any ideas of better methods to be more secure when compressing vm “untrusted” data?

rustybird · July 8, 2023, 3:05pm

It should be fine - I can’t imagine (it would be horrifying!) that Btrfs’s compression would treat untrusted file data as a anything but a meaningless blob of bytes. With compress= (compared to compress-force=) it’s simply trying to compress the first few of those bytes to see if the resulting compression ratio is good enough that it makes sense to turn on compression for the file. It’s not doing the equivalent of running the file utility to interpret the file’s content.

That sentence was about running e.g. e2fsck or mount in dom0 on a .img file containing a VM’s filesystem, which is dangerous because a mountain of C code would be parsing/interpreting the complex ext4 data structures contained within the file. (I just noticed that e4defrag, like btrfs filesystem defragment, is only for mounted filesystems anyway.)

Similarly, it should be fine to run btrfs filesystem defragment foo (with or without a -c compression argument) even if foo is a malicious file, because the command also doesn’t care about the specific content of foo at all. (Although just for the record - since it does care about how exactly foo is fragmented, in case of a VM’s image file the VM can influence that to theoretically! exploit bugs in the defragmentation code.)

logoerthiner · July 8, 2023, 5:02pm

I was thinking about that untrusted data may exploit compressor bugs; however in that case vm backup will be also affected.

rustybird · July 8, 2023, 7:33pm

The compressor side seems like it would be harder to get wrong in an exploitable way, cause it inherently has to deal with arbitrary data all the time in the normal course of operation, as opposed to the decompressor side dealing with more structured (but in this case thankfully, trusted) data that could be tampered with to make it weird and break some assumption in the code.

zstd did have CVE-2019-11922 related to compression (I don’t know if it affected Btrfs at all).

logoerthiner · July 10, 2023, 4:57am

Actually I am confused about -precache.imgs, and when I grep this in source code, I saw your name on the top of /usr/lib/python3.8/site-packages/qubes/storage/reflink.py, so I believe that you are the best person that I can seek for to answer the questions.

When I have powered off every vm other than dom0 itself, and I do not need the revisions, is it safe to remove every root* (revisions + precache) other than root.img in vm-templates/* folders, in btrfs setups? I am doing deduplication experiment and I am unsure about this.
Also while I feel so, I want to confirm whether all the img files under every vm are just normal sparse files inside the btrfs system and can be operated with various btrfs optimizations such as deduplications and compressions (so when I magically generated a file (for example receive a file with qvm-run into dom0) and move it to folders, it should work without problem), or such file has some special attribute to fix up?
When I am using btrfs setup, when vm is fetching its virtual disk, will dom0 file system cache inside dom0 memory be used? if it is true, is it wise option to make dom0 ram larger so that more disk blocks are cached in order to accelerate? is it wise to use dom0 fs cache or per-vm fs cache?

rustybird · July 10, 2023, 9:08am

The -precache.img file is safe to remove at any point in time. Almost the same is true for @ revision .img files, just don’t remove those during the VM’s shutdown or other operations that deal with revisions for that particular volume (like importing data when a backup is being restored)

They’re normal files, but if you replace or modify the content of e.g. private.img make sure you do remove the corresponding private-precache.img if it exists, ideally as the first step so that at no point in time a stale precache file exists. (There is some code that would catch that if the files have a different modification time but don’t rely on it, it’s not really meant to handle user error.)

Since R4.2 the loop devices that connect the image files to the VMs are set up with direct I/O to bypass the dom0 page cache.

Insurgo · July 23, 2023, 5:40am

Hey @rustybird ! Didn’t know you were file-reflink maintainer. Good to know!

If the present thread is about compression (which end goal is lower space consumption amongst other things, like reduced writes and faster reads) i have been interested in pool deduplication for a really long time, to finally find bees, but not found it compiled for recent version even on copr for fedora (dom0…) and built only for opensuse…

Pulled qubes-builder, tried to use it to build bees for Q4.2, didn’t reach make qubes. Gave up pet project.
Would be nice if someone could do the magic for bees to land in unstable repos to be tested on Q4.2. I would gladly open a thread on testing section if that happens and restore all my qubes which are mostly build caches for Heads project on different OSes (debian 10, 11, 12and nix templates specialized for different things, with minimal-debian for builds up to fully blown full of junk templates… Which would show massive deduplicate as proof of concept, proving worth of going that path im pretty sure!)
Interested?

I would really love to have your input on that!

If compression is interesting, i continue to believe that deduplication on the pool level makes so much sense for Qubes. Just looking at my wyng-backups deduped backups archive stats (whonix templates are 60% similar) and my clones of everything (no cost but diffs) until I use wyng to restore previously deleted local volumes (unless i manually clone a similar qube to restorr with wyng --sparse-write directly into thr thin lvm volume to onky wriye changes to keep duplication low), other full volume restoration or clones grow big after rhey are cloned and currently take all my pool from which i swap qubes in and off (and where dedup on write from the pool would save my days because i wouldn’t need do that anymore!)

Recommendation from @brendanhoar at Pool level deduplication? - #21 by brendanhoar
(Where i invite you to participate in that thread with your knowledge, of course!!!)

One last time.
Please please please push spec file for builder to have bees dropped under Q4.2 unstable repo?

Brtfs deduplication docs: Deduplication — BTRFS documentation

Most recent copr (8 months old 7.2 release is not good enough for fedora 37):
https://copr.fedorainfracloud.org/coprs/proletarius101/bees/builds/

Edit: found src rpm
https://rpmfind.net/linux/RPM/opensuse/tumbleweed/x86_64/bees-0.9.3-1.2.x86_64.html
Will try to rpmbuild on fedora 37 template I guess

Spec file Show openSUSE:Factory / bees - openSUSE Build Service (4 months old)

If that happens, i will follow guidance from tasket and restore my vms on Q4.2 on brtfs Restore lvm backups on brtfs? · Issue #166 · tasket/wyng-backup · GitHub (thanks @tasket !)

tasket · July 23, 2023, 12:19pm

Filesystem deduplication vs defragmentation is a balancing act, when you think about it: They usually do opposite things with the data & metadata. IIRC many defragging and dedup utilities have some notion of a granularity threshold below which they won’t make small adjustments. Even discard/trim code tries to take this into account, as it resembles deduplication and can increase fragmentation.

Adding compression changes the picture somewhat, because in an fs like Btrfs this causes the metadata usage to become quite high and more granular. In that case a fully compressed filesystem benefits more from deduplication than defragmentation.

I’m inclined to say that tuning the balance between them is not something a typical user should be very concerned about. They may become aware at some point that their storage could benefit from compression, and also that some of their VMs have a lot of data in common. Giving users a way to list VMs in ‘affinity groups’ (which are assigned to storage pools) would probably help, in addition to having a way to indicate that compression is preferred for those groups (or individual VMs). With two VMs listed in an affinity group, an OS could periodically deduplicate their volumes together, or a backup program could automatically use a snapshot from one VM as the starting point for a restore on the other VM.

brendanhoar · December 28, 2023, 8:53am

FWIW, I think it is important to recognize the primary area in Qubes that could benefit the most from deduplication is the backup/restore process.

In particular, for those who set up customized templates based off a single distro (e.g. cloned and diverged), it’s pretty easy to end up with a backup/restore circuit where it is no longer possible to restore everything you backed up onto a freshly re-installed system because the original data was implicitly deduped due to template cloning, but the restore process cannot do that and the restore process will run out of space. You end up having to either discard and recreate some diverged templates or install new storage hardware with higher capacity.

B

logoerthiner · February 9, 2024, 2:20am

In btrfs when I specify “DUP” profile, the one gets read is determined by pid parity (BTRFS raid-1: which device gets the reads? - Stack Overflow). I am curious that when vm launch disk read request, in btrfs what parity will the pid be? Which pid will such read request be from? Can i control this reliably if i have a ssd and a hdd?

rustybird · February 11, 2024, 2:01pm

Not sure, but I’ve replied in your other thread: