ZFS and Qubes OS in 2026

Some commentary, some open questions.

dom0 impact of ZFS

$ sudo qubes-dom0-update zfs works. But it installs 218 packages with an installation footprint of 291 MB, and these packages must stay installed because the kernel module must be rebuilt for every kernel upgrade. That’s tough to swallow re: security hygiene. I don’t think there’s any getting around this. The module build and some of the package footprint could be offloaded to an assistant VM, but ultimately you’re going to be running the product of that build in your dom0.

Swap memory under ZFS

ZFS swap datasets are unsound under high contention of memory, as ZFS must do memory allocation for metadata as part of storage, including swap storage, so there is a risk of OOM → swap offloading → ZFS allocation → but OOM! → doom.

It is not complicated to create a regular swap partition outside of the zpool and assign that for dom0’s use. But what about /dev/xvdc1 swap, in every running qube? Is that similarly risky if backed by ZFS? It is difficult to think through the possible memory contention scenarios, through the layers of abstraction/virtualization and my own partial understanding. Perhaps the risk is naturally mitigated by qube maxmem thresholds and qmemmand, and assigning extra slack memory to dom0 for ZFS’s use; or perhaps it’s more complicated.

  • Plausible workaround for domU: qubes-prefs default_pool_volatile should be directed to a different ‘regular’ Qubes pool outside of ZFS? Is it as simple as that? How large should the pool be – n * 1024M, where n is the max number of VMs you expect to run concurrently?
    • Drawback: loss of the data safety/integrity guarantees provided by ZFS vs ext4/xfs/etc.

For my present install I elected to YOLO dom0 swap and /dev/xvdc domU swap on ZFS, mostly because it was simplest to do, and rationalized with an intention to experiment with zram for less dependence on swap.

/boot on ZFS

It’s marginal, but I would prefer to move /boot to ZFS too for checksumming and mirroring but I’m skeptical this will Just Work with coreboot+Heads.

Separate ZFS pools or separate datasets within a single ZFS pool

Under LVM, the dom0 root and the domU vm-pool are separate logical volumes. Under ZFS, the topology could be two partitions (perhaps under LVM) for two ZFS pools – one for dom0 root and another for the vm-pool – or it could be a single partition of a single zpool holding separate datasets for dom0 root and vm-pool. I think the single zpool approach is more efficient from a ZFS perspective, and the datasets can be tuned and quota’d individually just the same as separate zpools can. I’m not sure if there is more to consider here.

  • Are there security implications to including dom0 and domU storage within a single ZFS pool?

SSDs and free space

SSDs perform better and last longer having some amount of free storage. In a default partitioning QubesOS reserves 10% of the LUKS partition (so, a little less than 10% of the whole disk) as free space for this cause. ZFS too performs better and is more reliable with some amount of free storage, and keeps its own reservation for this (IIUC, 1/(2^5) = ~3.2% of the pool by default).

  • Is it redundant or is it helpful to have two separate reservoirs of free space? Could I forgo or reduce Qubes’s 10%?
  • Then change ZFS’s spa_slop_shift to 3 so that 1/(2^3) = 12.5% is reserved? (or 4, 1 / 2^4 = 6.25%?)

ZFS ARC and dom0 assigned memory

ZFS caches filesystem reads in its in-memory ARC, which by default can/will grow to >90% of system memory in contemporary ZFS. Under free this will show as used rather than as buff/cache, but nonetheless the cache memory is reclaimable I am reassured.

The ARC lives on the dom0 VM. Usually dom0 is assigned a pretty small slice of system memory, the rest for VMs. But hosting the ARC, dom0 needs more memory. Also, probably best there be a tighter limit on the max size of the ARC → less risk of OOM in dom0.

  • Larger dom0_mem=min:####M dom0_mem=max:####M on GRUB_CMDLINE_XEN_DEFAULT. How much makes sense, I wonder? I am assigning a lot for now while I experiment with ARC sizes.

I am clamping the size of the ARC with a module config file:

/etc/modprobe.d/zfs.conf:

# ARC clamped to 3GB-4GB of memory
options zfs zfs_arc_max=4294967296
options zfs zfs_arc_min=3221225472
  • Note: specify the max before the min, or it seems neither will take effect
  • Still tuning the size. My first experiment clamped it to 4GB-6GB, and after a couple days of use metrics showed an ARC hit rate of 99.5%, which seems to me “too high”, hence the decreased range.

More:

Defaults

The qubes pool created by qvm-pool add -o container=<zpool name>/<dataset name> <qubes pool name> zfs has revisions_to_keep of 1, while the default LVM Qubes vm-pool has revisions_to_keep of 2. Not sure if this is an intended difference.

The dataset that qvm-pool creates has these properties (on a zpool created with all defaults / no property overrides):

[user@dom0 ~] zfs get all <zpool>/<dataset>
# ... most lines snipped ..
# recordsize  128K
# compression on     (implies lz4)
# atime       on
# xattr       sa
# copies      1
# dedup       off
# acltype     off
# relatime    on
# encryption  off
# direct      standard
# org.qubes-os:part-of-qvm-pool true

Compression

On a system with between 100 and 200 VMs:

[user@dom0 ~]$ zfs get compressratio laptop laptop/ROOT/os laptop/dom0-swap laptop/vm-pool
NAME              PROPERTY       VALUE  SOURCE
laptop            compressratio  1.69x  -
laptop/ROOT/os    compressratio  1.68x  -
laptop/dom0-swap  compressratio  3.38x  -
laptop/vm-pool    compressratio  1.69x  -

An overall compression ratio of 1.69x, I think meaning the size would have been 1.69x larger under no compression, or equivalently the compressed size is 1 / 1.69 = 59% the size of the uncompressed size, a storage savings of 41%. Pretty good!

ZFS tuning

The zpool:

  • I see ashift=12 = 4k pool sector size recommended often, even for disks that report 512 byte sector size. But I also see it said that for many NVMe SSDs that (honestly) report 512B, ashift=9 (512B pool sector size) is most performant. It’s not clear to me if the potential performance gain/loss is worth the testing and tweaking, so I have left this at ashift=0 (default, autodetect) which for me results in ashift=9, which aligns with what my SSDs self-report.
  • ZFS encryption is not well maintained and no longer recommended by ZFS gurus, so encryption is best handled by LUKS
  • autotrim?

The vm-pool dataset:

  • Common case IO is characterized by random reads and writes within large VM image files. So a recordsize of 16k or 128k would make more general sense than 1M, I think, except for the bulk copy that happens when a VM is cloned. However, individual VMs have different purposes and can have very different IO characteristics. What’s best for a vault qube is not what’s best for a torrenting qube or a server qube. I think this property can be tweaked per-VM within the same vm-pool, as each VM is itself a (sub)dataset.
  • atime=off and relatime=off? Nothing I work with cares about file access timestamps, and this saves some writes, specifically the kind-of dubious IO path of a write induced by a read.
  • Compression is on by default, compress=lz4. As shown above, this compression provides major space saving for VMs, with minimal processing cost.
  • Deduplication through dedup, though better than it used to be, is still probably not worth the RAM/metadata and processing cost? Maybe it would have value in a templates-only vm-pool/dataset? (se below)
  • direct is a newer IO feature said to be beneficial for NVMe and for virtualization, but maybe only for a carefully tuned workload? There is not much information on this yet. I am leaving it at its default (direct=standard)

The dom0 root dataset:

  • In hour-to-hour use dom0 is mostly read-only, with trickles of writes for logging. Occasionally, package upgrades. Rarely, a bulk copy during template installation. If this is a good summary, there’s little to gain (and I guess little to lose) in tuning.

Template deduplication?

It’s a refrain in ZFS literature that dataset deduplication doesn’t provide the tradeoffs the admin hopes it will, outside pretty specific workloads. If dedup is on and the recordsize is small then as the data grows so does the memory+processing cost of maintaining the metadata - the tracking for every single small block. The costs grow supralinearly, I understand. Major space savings, but not for free.

I am curious if it could have useful application specifically for a TemplateVM dataset in the zpool, for users who have many custom templates. I have very many templates, most of which are dupes of each other modulo a few packages. If all of these templates, and only templates, and perhaps only the root images of the templates, were placed in a dataset with deduplication enabled and a large recordsize (1 MB or greater) then perhaps the costs would be workable.

  • Large recordsize means fewer total blocks and less metadata tracking
  • VM images are big single files, so no storage waste due to small files + large recordsize
  • Template root images are mostly duplicate data, so few(er) unique blocks overall
  • Tempates are only written to during Qubes Update, so write performance isn’t that important, and the write amplication impact to SSDs from large recordsize would be mitigated by the hopefully much smaller total on-disk footprint.

On the other hand: like most users I have plenty of storage, and anyway compression is already providing good savings.

Misc notes

  • Sometimes qvm-remove will leave an empty directory/dataset for the VM in the zpool. zfs list shows several laptop/vm-pool/disp####, for example. Also laptop/vm-pool/.importing and laptop/vm-pool/.tmp - perhaps leftovers from a qvm-backup-restore run?
6 Likes

Admittedly I am doing something a bit different than using ZFS on the host.

I sometimes format a thumbdrive with ZFS (obviously it’s just a plain old one-disk pool) so I install zfs on my VMs (though the VM itself does not use ZFS…seems pointless for a disposable).

1 Like

The size of ZFS’s codebase is comparable to that of btrfs which is found in the linux kernel. To me the scope and size of the project is not the main concern wrt security. What concerns me is that the code is signed by the ZFS authors, not any of the parties that are normally given ultimate trust on a Qubes system. This is not in any way a comment on the trustworthiness of the ZFS authors-- and I would say the same no matter who it was-- the thing I’m getting at is only that it represents an additional potential point of failure in your chain of trust.

Separately from that concern, I noticed a year or two ago that one of the ZFS dev signing keys was using an obsolete algorithm (not sure if this has been fixed).

I have to conclude that its not. A guest filling swap because the guest is low on memory, has no relation to the host system’s free memory, and it is the host system wherein the ZFS overhead for /dev/xvdc1 happens.

As for dom0 swap, I don’t know if swap offers any benefit. Resource usage by dom0 is incredibly stable, so if you are hitting swap it probably means you should just increase dom0 memory. If you do use swap though, it ought to be its own dedicated partition on the main partition table for more-or-less reasons you mentioned.

I see no reason to keep dom0 and domUs in separate pools, for security or otherwise.

Whether that’s an intended use of spa_slop_shift , I have no idea. I’ve heard 15% free is optimal for ZFS performance, and I’ve accomplished that with quotas.

Depending on where exactly the free space is reserved, it might be a good idea to double-check that trim commands can propagate to the actual SSD.

I hope I am not making things up here but… as far as I am aware, ARC cannot be reclaimed by the kernel the same way linux memory cache can (instantly) when programs request more memory from the kernel. ARC will try to shrink when it recognizes the system is low on memory, or when the kernel announces memory pressure, but it may be too late for some memory requests. In practice this would mean that yes programs can grow their memory usage, but only if they do so no faster than ARC relinquishes it, and will fail at requests that ask for a very large amount of memory all at once.

I won’t say it could never be an acceptable risk, however I’m never going to recommend anyone use compression on a Qubes pool, because it breaks isolation between dom0 and your qubes. By using compression you are tasking dom0 with running compression/decompression algorithms on data that is provided by guests. Those are algorithms which are not immune to security vulnerabilities. In fact I would recommend anyone currently using it to disable compression at least until they’ve seriously considered the risk.

idk about the state of ZFS encryption, but I use LUKS because, when it comes to encryption, why would I want to use anything but the gold standard?

recordsize isn’t applicable to qubes, because qubes use zfs volumes not filesystems. recordsize is for filesystems. I think the analog for volumes would be volblocksize.

Redundant data across templates can be reduced by cloning them from a common template. The disk usage cost of cloned volumes is almost free since they share all the data blocks. The limitation of that is that less and less blocks remain shared as you perform updates on the templates and old blocks are replaced.

Yeah, the Qubes ZFS driver certainly has some room for improvements

1 Like

Thanks for your insight. I will reply to some of what you wrote and maybe come back with more later.

I think you are probably right. As long as dom0 always has free memory, and is the ZFS management VM so to speak, the swap OOM doom scenario should not happen, on dom0 or domU. At the least, ZFS swap is probably safer on Xen/Qubes than in a regular OS where a misbehaving app could wipe out all the system’s available memory. But need to keep the ARC size constrained and admin daemons memory leak-free.

I think your understanding is right.

I never considered that so I’ve been thinking about it. I do not run compression/decompression tools in dom0… yet I feel comfortable running implementations of these same algorithms, on my whole local storage array, in dom0? Some cognitive dissonance there.

And extrapolating that line of thought, any processing of domU storage carries some risk. The hashing done to create metadata to enable snapshotting and scrubbing-- suspect!

A security flaw of this kind in the ZFS compression code would be a catastrophe, as it’s such a widely used filesystem, and compression a touted feature. But, by virtue of its long existence, continued development, and wide use, it is a very well-tested and well-audited filesystem. As far as that goes for software of any kind, and especially that written in a non-memory safe language.

I think I will continue to use compression. I admit though there is some motivated reasoning on my part here, because I like the feature and I want to be using it. I won’t be as comfortable using it as I was before.

Thanks for this, I overlooked it entirely. I chose a recordsize of 128k but not my volblocksize, which defaulted to 16k. I guess that is still ok, but it’s not what I intended.

Cool :+1: poor man’s dedup, for free. I suppose one could “rebase” their custom templates by periodically recreating them from the base template, resetting the block divergence.

1 Like

Glad you found it helpful :slight_smile:

Content hashing poses no risk, in my unqualified opinion.

Relevant to the potential for security vulnerabilities, there are some crucial qualitative differences separating cryptographic algorithms and compression algorithms:

  1. Cryptographic primitives are pure math, no logic, and therefore contain no conditional statements / branching (except iterators). Compression on the other hand has to perform analysis on the data, identifying patterns, and making decisions-- and therefore has many conditionals. This is a danger because the inputs can influence code execution paths which ultimately has an influence on the process executing it.

  2. Cryptographic primitives don’t have invalid values for inputs. Decompression functions do, and those inputs can be invalid in many ways. Improperly handled invalid inputs are of course a major source of security vulnerabilities.

  3. Cryptographic primitives in some sense operate on fixed-length data, and so avoid needing to do memory management. Even cryptographic hashes that digest variable-length input, are doing so by iterating over one fixed block at a time, they are not internally allocating buffers. Compression on the other hand builds dictionaries based on the entire input text and for that has to perform memory management, which opens it up to the worst categories of vulnerabilities, incl. buffer overruns and use-after-free.

I think the consequences of these differences would be borne out if you researched any cryptographic or compression algorithm in a public vulnerability database.

If you still want to use zfs compression, consider at least disabling it on your higher-risk qubes, such as the one you use for browsing random websites (browser cache wouldn’t benefit much from compression anyway). By the way, a safe alternative would be to perform the compression inside the qube itself.

1 Like