Some commentary, some open questions.
dom0 impact of ZFS
$ sudo qubes-dom0-update zfs works. But it installs 218 packages with an installation footprint of 291 MB, and these packages must stay installed because the kernel module must be rebuilt for every kernel upgrade. That’s tough to swallow re: security hygiene. I don’t think there’s any getting around this. The module build and some of the package footprint could be offloaded to an assistant VM, but ultimately you’re going to be running the product of that build in your dom0.
Swap memory under ZFS
ZFS swap datasets are unsound under high contention of memory, as ZFS must do memory allocation for metadata as part of storage, including swap storage, so there is a risk of OOM → swap offloading → ZFS allocation → but OOM! → doom.
- Related issue: Swap deadlock in 0.7.9 · Issue #7734 · openzfs/zfs · GitHub
- Summary: this appears to be an architectural problem in ZFS, perhaps solvable by pre-allocation of whatever ZFS metadata memory could be needed for managing swap. In any case, ZFS swap reliability requires upstream work.
It is not complicated to create a regular swap partition outside of the zpool and assign that for dom0’s use. But what about /dev/xvdc1 swap, in every running qube? Is that similarly risky if backed by ZFS? It is difficult to think through the possible memory contention scenarios, through the layers of abstraction/virtualization and my own partial understanding. Perhaps the risk is naturally mitigated by qube maxmem thresholds and qmemmand, and assigning extra slack memory to dom0 for ZFS’s use; or perhaps it’s more complicated.
- Plausible workaround for domU:
qubes-prefs default_pool_volatileshould be directed to a different ‘regular’ Qubes pool outside of ZFS? Is it as simple as that? How large should the pool be –n * 1024M, wherenis the max number of VMs you expect to run concurrently?- Drawback: loss of the data safety/integrity guarantees provided by ZFS vs ext4/xfs/etc.
For my present install I elected to YOLO dom0 swap and /dev/xvdc domU swap on ZFS, mostly because it was simplest to do, and rationalized with an intention to experiment with zram for less dependence on swap.
/boot on ZFS
It’s marginal, but I would prefer to move /boot to ZFS too for checksumming and mirroring but I’m skeptical this will Just Work with coreboot+Heads.
- Related issue: Include ZFS · Issue #187 · linuxboot/heads · GitHub
- btrfs instead? but maybe not: Add BRTFS under Heads (recovery shell purpose) · Issue #1202 · linuxboot/heads · GitHub
Separate ZFS pools or separate datasets within a single ZFS pool
Under LVM, the dom0 root and the domU vm-pool are separate logical volumes. Under ZFS, the topology could be two partitions (perhaps under LVM) for two ZFS pools – one for dom0 root and another for the vm-pool – or it could be a single partition of a single zpool holding separate datasets for dom0 root and vm-pool. I think the single zpool approach is more efficient from a ZFS perspective, and the datasets can be tuned and quota’d individually just the same as separate zpools can. I’m not sure if there is more to consider here.
- Are there security implications to including dom0 and domU storage within a single ZFS pool?
SSDs and free space
SSDs perform better and last longer having some amount of free storage. In a default partitioning QubesOS reserves 10% of the LUKS partition (so, a little less than 10% of the whole disk) as free space for this cause. ZFS too performs better and is more reliable with some amount of free storage, and keeps its own reservation for this (IIUC, 1/(2^5) = ~3.2% of the pool by default).
- Is it redundant or is it helpful to have two separate reservoirs of free space? Could I forgo or reduce Qubes’s 10%?
- Then change ZFS’s
spa_slop_shiftto 3 so that1/(2^3) = 12.5%is reserved? (or 4,1 / 2^4 = 6.25%?)
ZFS ARC and dom0 assigned memory
ZFS caches filesystem reads in its in-memory ARC, which by default can/will grow to >90% of system memory in contemporary ZFS. Under free this will show as used rather than as buff/cache, but nonetheless the cache memory is reclaimable I am reassured.
The ARC lives on the dom0 VM. Usually dom0 is assigned a pretty small slice of system memory, the rest for VMs. But hosting the ARC, dom0 needs more memory. Also, probably best there be a tighter limit on the max size of the ARC → less risk of OOM in dom0.
- Larger
dom0_mem=min:####M dom0_mem=max:####MonGRUB_CMDLINE_XEN_DEFAULT. How much makes sense, I wonder? I am assigning a lot for now while I experiment with ARC sizes.
I am clamping the size of the ARC with a module config file:
/etc/modprobe.d/zfs.conf:
# ARC clamped to 3GB-4GB of memory
options zfs zfs_arc_max=4294967296
options zfs zfs_arc_min=3221225472
- Note: specify the max before the min, or it seems neither will take effect
- Still tuning the size. My first experiment clamped it to 4GB-6GB, and after a couple days of use metrics showed an ARC hit rate of 99.5%, which seems to me “too high”, hence the decreased range.
More:
- ZFS ate my RAM: Understanding the ARC cache | ~/git/blog
- What is a good value for zfs_arc_max? · openzfs/zfs · Discussion #14064 · GitHub
Defaults
The qubes pool created by qvm-pool add -o container=<zpool name>/<dataset name> <qubes pool name> zfs has revisions_to_keep of 1, while the default LVM Qubes vm-pool has revisions_to_keep of 2. Not sure if this is an intended difference.
The dataset that qvm-pool creates has these properties (on a zpool created with all defaults / no property overrides):
[user@dom0 ~] zfs get all <zpool>/<dataset>
# ... most lines snipped ..
# recordsize 128K
# compression on (implies lz4)
# atime on
# xattr sa
# copies 1
# dedup off
# acltype off
# relatime on
# encryption off
# direct standard
# org.qubes-os:part-of-qvm-pool true
Compression
On a system with between 100 and 200 VMs:
[user@dom0 ~]$ zfs get compressratio laptop laptop/ROOT/os laptop/dom0-swap laptop/vm-pool
NAME PROPERTY VALUE SOURCE
laptop compressratio 1.69x -
laptop/ROOT/os compressratio 1.68x -
laptop/dom0-swap compressratio 3.38x -
laptop/vm-pool compressratio 1.69x -
An overall compression ratio of 1.69x, I think meaning the size would have been 1.69x larger under no compression, or equivalently the compressed size is 1 / 1.69 = 59% the size of the uncompressed size, a storage savings of 41%. Pretty good!
ZFS tuning
The zpool:
- I see
ashift=12= 4k pool sector size recommended often, even for disks that report 512 byte sector size. But I also see it said that for many NVMe SSDs that (honestly) report 512B,ashift=9(512B pool sector size) is most performant. It’s not clear to me if the potential performance gain/loss is worth the testing and tweaking, so I have left this atashift=0(default, autodetect) which for me results inashift=9, which aligns with what my SSDs self-report. - ZFS
encryptionis not well maintained and no longer recommended by ZFS gurus, so encryption is best handled by LUKS autotrim?
The vm-pool dataset:
- Common case IO is characterized by random reads and writes within large VM image files. So a
recordsizeof 16k or 128k would make more general sense than 1M, I think, except for the bulk copy that happens when a VM is cloned. However, individual VMs have different purposes and can have very different IO characteristics. What’s best for a vault qube is not what’s best for a torrenting qube or a server qube. I think this property can be tweaked per-VM within the same vm-pool, as each VM is itself a (sub)dataset. atime=offandrelatime=off? Nothing I work with cares about file access timestamps, and this saves some writes, specifically the kind-of dubious IO path of a write induced by a read.- Compression is on by default,
compress=lz4. As shown above, this compression provides major space saving for VMs, with minimal processing cost. - Deduplication through
dedup, though better than it used to be, is still probably not worth the RAM/metadata and processing cost? Maybe it would have value in a templates-only vm-pool/dataset? (se below) directis a newer IO feature said to be beneficial for NVMe and for virtualization, but maybe only for a carefully tuned workload? There is not much information on this yet. I am leaving it at its default (direct=standard)
The dom0 root dataset:
- In hour-to-hour use dom0 is mostly read-only, with trickles of writes for logging. Occasionally, package upgrades. Rarely, a bulk copy during template installation. If this is a good summary, there’s little to gain (and I guess little to lose) in tuning.
Template deduplication?
It’s a refrain in ZFS literature that dataset deduplication doesn’t provide the tradeoffs the admin hopes it will, outside pretty specific workloads. If dedup is on and the recordsize is small then as the data grows so does the memory+processing cost of maintaining the metadata - the tracking for every single small block. The costs grow supralinearly, I understand. Major space savings, but not for free.
I am curious if it could have useful application specifically for a TemplateVM dataset in the zpool, for users who have many custom templates. I have very many templates, most of which are dupes of each other modulo a few packages. If all of these templates, and only templates, and perhaps only the root images of the templates, were placed in a dataset with deduplication enabled and a large recordsize (1 MB or greater) then perhaps the costs would be workable.
- Large
recordsizemeans fewer total blocks and less metadata tracking - VM images are big single files, so no storage waste due to small files + large
recordsize - Template root images are mostly duplicate data, so few(er) unique blocks overall
- Tempates are only written to during Qubes Update, so write performance isn’t that important, and the write amplication impact to SSDs from large
recordsizewould be mitigated by the hopefully much smaller total on-disk footprint.
On the other hand: like most users I have plenty of storage, and anyway compression is already providing good savings.
Misc notes
- Sometimes
qvm-removewill leave an empty directory/dataset for the VM in the zpool.zfs listshows severallaptop/vm-pool/disp####, for example. Alsolaptop/vm-pool/.importingandlaptop/vm-pool/.tmp- perhaps leftovers from aqvm-backup-restorerun?