@brendanhoar your opinion differs?
I’m not yet familiar with the Qubes btrfs pool implementation so I don’t know.
E.g. is it stacked such that VM volumes exist as reflinked files in an outer btrfs volume (I.e. using the file-reflink driver)? If so perhaps offline dedupe is possible in that outer volume.
But even if, there are alignment & chunk size issues at play that might not give expected results on diverging templates.
Advices? I linked https://github.com/QubesOS/qubes-issues/issues/7009 to here and asked to have a bounty tag on the issue you opened.
@brendanhoar any insights/opinions/references on ZFS being unfit? Everything I read about it seems to say that it would fit QubesOS use case of multiple templates clones and specialisation, with no penalties if RAM is available when dedup is enabled on the pool.
And deduplication is applied pool-wide, and live (no need to apply dedup on offline volumes, since dedup is applied on writes, hence the memory costs).
[quote=“Insurgo, post:7, topic:12654”]
any insights/opinions/references on ZFS being unfit?[/quote]
Caveat: I have no direct experience.
ZFS deduplicatiom has always been very memory intensive. Qubes has always been sensitive to memory constraints. Difficult mix to resolve.
Last I am aware of ZFS on Linux has to run in user space which makes it very non-performant (read: very slow). This is mostly due to incompatible licensing, so ZFS had to be implemented as a userspace driver, which leads to a lot of extra security context switching overhead.
There may also be a third party group that has recently integrated it with the Linux kernel but that’s probably a license violation.
ZFS licensing and Linux is complicated. The general consensus is that distributing ZFS using DKMS (which builds the modules on the user’s machine) is legal. The question is whether distributing binary ZFS kernel modules is legal. That is a question for the ITL legal team to deal with, not me.
That said, ZFS deduplication requires so much memory that it may well not be an option in practice. It also seems to have horrible performance problems unless something has changed since that post was written.
VDO seems to have much better performance, and one could implement VDO on top of some LVM layers and below others. Unfortunately, VDO is currently only supported on RHEL-compatible kernels, not the ones used by Qubes OS. Furthermore, VDO has out-of-space handling problems.
I offered a bounty a long time ago, but I was told that project donations would be redirected to what the project considered appropriate rather than the feature I was going to fund.
Weird claims I see on this post, none of them sourced. Will correct the record on some now.
Re licensing issue: I don’t much care about the licensing problem — I don’t distribute compiled ZFS modules, so for me that’s a nonissue. I am also unaware of anyone actively violating the CDDL or the GPL by doing this.
Re performance: ZFS on Linux / OpenZFS is NOT a user space daemon. That would be ZFS-FUSE, another project, to which I contributed code.
Re deduplication: dedup requires very little memory vis a vis the amount of memory Qubes OS users have and storage size they have.
I still want the ZFS adapter for Qubes pools to be shipped with Qubes OS, whether most users use it or not. ZFS is the absolute best file system I have ever used (I’ve used many) and I don’t plan to move away from it anytime soon. It’s simple to manage (no “layers” shit, like LVM or VDO or LUKS), it’s performant enough, it has compression / snapshots / send+receive, it has been tested way more than any other file system on Earth… what’s not to like about these things?
@demi do you have source? As stated in OP sources I’ve found (not exhaustive, but trying to bonify them here):
I agree with @Rudd-O:
If we talk about a 1tb of non-deduplicated storage space that consumes 1gb ram, i would use that. If backuping/restoring would only also take dedupped space, I would also use that.
On VDO and what I understood of it, it is deduplicating on block (filesystem) layer, not pool-wide:
From own quoted source:
WHAT IS THE DEDUP TABLE (DDT)?
AND… IS IT STORED IN RAM, OR ON DISK?
This is a common point of confusion.
When deduplication is used, the dedup table is part of the way that data is stored in the pool. ZFS uses a hashed list of blocks, to allow easy identification of duplicate blocks. In simple terms, to find an actual block of data on disk, ZFS uses the DDT as an extra step.
The DDT is a fundamental pool structure used by ZFS to track what blocks make up what files, when dedup is used. It’s as much a part of the pool as the dataset layout, the snapshot info, pointers to files, or the file date/time metadata. If you lose it, your pool is dead. If ZFS needs it, it reads it from the pool on demand, and uses the data contained to identify not just duplicate blocks, but also to find data on disk. So the dedup table is not a cache or an extra (like ZIL or L2ARC), that gets stored in RAM and if we lose it. too bad. If DDT data is in RAM or L2ARC, it’s only there temporarily, like any other in-use pool data.
In other words, for all practical purposes you can think about ZFS handling dedup metadata identically to any of that sort of stuff, if that helps. It’s integral to the pool. And the pool won’t work well if it can’t access the DDT fast, when needed.
Modifying OP accordingly and posts sayimg 1tb of atorage deduped data needs 1gb ram. Basically as of now, @Rudd-O knows best being a user of ZFS.
Mind to share your experience and use cases? How does it behave on Qubes intended use case? (Deduped templates, or, specialized clones?)
Also, @Rudd-O, any comment on performance hug for combined read and writes from referred article, writer saying that special SSD are required (not to go for Samsung EVO pro devices)?
Author recommends optane SSDs:
Optane and pure battery backed RAM cards only .
I should clarify: That’s nothing to do with SSDs having too-small DRAM or SLC cache. It’s inherent in the SSD NVRAM chips themselves. Because it’s nothing to do with the device cache type or size, a “better” SSD or one with “better” or no cache, won’t help much.
That is a very valid trade-off (to each their own ) and would definitely justify supporting it, especially on high-end systems with 32GiB or more of RAM.
VDO is block-layer deduplication. One puts it below filesystems or LVM thin provisioning, and above encryption and/or RAID. Therefore, it works across volumes. One would need to patch LVM to allow VDO to be used as a thin-pool data volume, however, as LVM does not allow this because of out-of-space recovery problems. Recovery would be very difficult and would at a minimum require adding more storage to the VDO volume, which can never be removed.
If this statement is accurate, then I do not believe the Qubes team can recommend deduplication. The Optane SSDs used by the author of that post have been discontinued, and the only pure-X-Point storage available is the enterprise drives. Those are extremely expensive, so I doubt they will be an option for the majority of people.
@Demi articles contradicts themselves. The article you referred talked about old OpenZFS implementations as well and referred to SSDs that were pre-2020 (which explains recommendation for hardware not existing anymore too) . Mixed random reads and writes are more common but yet again, i’m no expert here but could experiment. And again, his use case is on multiple terabytes of data, which is not the use case of Qubes, at all.
As current state of knowledge gathering, it seems still that OpenZFS would still be the best candidate for live pool level deduplication use case.
That is, taking templates clones and their specializations without growing required disk space would be covered. Let’s say that that pool could even be separated for Templates+Standalones from vm-pool if that idea eases your mind. As a starter, that would mean, at installation, that whonix-workstation and whonix-gateway would be mostly deduplicated instead of duplicated+differences.
Then extend that concept to having minimal-debian and minimal-fedora deployed, and then specialized. As I see it, we could reduce template costs by a huge ratio just on that level alone.
As for qubes clones, I do that a lot for development, with a lot of built stuff i can reuse and also specialize. No cost at cloning today, but exploding costs after cloning from divergence between origin and clones, where origin also changes overtime which is where costs explodes. Plus, and foremost, I cannot go back at using qubes backup tool bit for specific needs anymore for the same reasons: restoring backups explode the consumed space, where I now use wyng-backup instead, with scripts to clone from something close to it, then inject changes from wyng-backup, which also dedups on send, economizing on both sides: backups and restored(pool) space.
I would be really curious to experiment OpenZFS and live dedup for Qubes OS. And would probably use it myself, being able to keep clones on disk longer, without needing to move them away with wyng-backup.
That means that having this activated on the pool level (which can be activated on pool after deployment) would show gain in space with a tradeoff on cpu/memory, when activated only.
I don’t use dedup in Qubes. No need to:
- Storage is aplenty.
- In a Qubes OS use case, both the file pool driver and a hypothetical ZFS pool driver clone disk images of VMs, meaning the only additional storage you’d use is always reduced to a delta between the base image and anything you’ve written to it.
That said, if you were to e.g. store the same ISO image in the /rw partition of, say, five VMs — I donno why you’d do that, but you could — dedup would ensure you’d use 1x the ISO size on disk plus the minimal space to account for the 5X block references (tiny).
Also, you absolutely don’t need fancy SSDs or HDDs to use ZFS. The only use case where fancy SSDs will be of benefit, is in extremely intensive read+write workloads, or perhaps forced
sync workloads. Your data is safe (up to the last 30 seconds) under normal workloads, even if your SSD is not battery-backed, thanks to ZFS transactions properly committing dirty buffers to the ZIL every 30 seconds. Your data is safe in all
sync workloads because it’s all force-written immediately to the ZIL. The only case your data is not safe (with any file system) is when the disk lies to the OS after the OS has told the disk “commit this immediately”. Don’t buy shitty disks!
I have over 20G of usable disk space across all of my machines, all running with ZFS, some with HDDs, some with SSDs, some with a combination thereof (actual physical disk space is much bigger because all of my workloads are redundant). Never ever have I experienced any data loss since I started using ZFS.
VDO sucks (primarily but not only) because it’s yet another layer in a stack of multiple layers you have to manage. It might or might not actually deduplicate (I tend to believe it does), but the whole thing of piling this and that layer atop the previous layers (disk + lvm pv + lvm vg + lvm lv + vdo + ssd cache + fs)… sorry, makes me want to puke. I much prefer “zpool create” then “zfs create” — bing bang boom you got yourself a file system that can snapshot, deduplicate, checksum, compress, write transactional logs to NVDIMM/SSD, even cache on them.
A note on deduplication:
Deduplication is almost always going to be inferior to e.g.
cp --reflink or
zfs clone. Why? Because with deduplication, to dedup X blocks, the computer has to read all X blocks, checksum them, then write all X blocks — all with a roundtrip to userspace — after which the file system / device driver can then say “oh, I’ve seen dis b4, lemme jest create a ref to those blocks”. That is a hyoooooge waste of compute, which is only justified if your workload cannot be organized in any other way! If you can get away with
cp --reflink or
zfs clone (this in particular I expect in a hypothetical ZFS pool driver for qubes), you should almost certainly use that.
This gets to the crux of the issue: the amount of storage “lost” due primarily to updates to diverging clones of the same Qubes template is rather small. If this concern is the primary driver for wanting deduplication, I think it’s misguided.
Also, with R4.1 of Qubes the new qvm-template-gui tool even lets you know when a more recent build (baseline) of a template is available.
That’s a good opportunity to replace the template and rebuild your forks [via salt (advanced), bash script (intermediate), manual (beginner)]. If you do so and remove your older forks you’ll net gain some space back, plus have a cleaner template set.
Which reminds me, today would be a good time to set up a script to customize a print-template fork for using our our laser printer via USB, so that I don’t have to do it manually next time. I’m getting a lot of requests to print out Wall-E coloring pages this week…and my client is quite demanding on timelines (he’s four years old).
Maybe my usage is not representative (clones of qubes with lotsa build artifacts, to be specialized to economize building time, plus specialized templates).
Here again, it seems that the recommended way is again to:
And have those qubes and unused (bit rotting) templates backuped with wyng-backup (deduped on send) where user needs to be conscious at restore (since not deduped).
diverging clones of the same Qubes template
I have the base Fedora X template that Qubes ships, and a full clone of the template which is my main and only template, because I don’t need more. I could at any point in time erase the template theat Qubes OS ships, and I would reclaim what, 5 GB? My actually-used template is so far updated that pretty much nothing is in common with the base Fedora X template, so dedup would gain me exactly zero bytes, and waste some RAM for no good reason.
Currently, dedup is probably not worth using — in particular with a proper ZFS pool driver — except in niche cases. I could see how dedup would be useful after Qubes 4.1 when the
file driver is deleted and the only thing remaining is the
file-reflink driver (which under Qubes would make full copies of VM images, which then would benefit from dedup), but using the
file-reflink driver with ZFS will be unbearable simply because it would take forever just to start a VM (ZFS
master does not support
reflink copies right now, and support for
reflink is currently under a lengthy review process).
What I understand as of today is the deduplication would resolve (consensus is not here) our diverging use cases. demi would benefit of it (clones of qubes, clones of templates. Restore of templates and qubes (wyng backup restoration). Some of us are not cloning templates. There is not even consensus on the benefits of deduplication vs reflinks in this thread.
I understand that OpenZFS pool support would be a matter of simplicity vs current LUKS+VG+LVM to obtain similar results without current complexity.
I understand as well, from my reading, that VDO would add even more complexity to the current setup and would not be directly usable from Qubes perspective either (and from my reading VDO was unfit in documented use cases where it was applied in LVM not at at pool level, but I might have missed something).
So the conclusion as of today of this thread is that it is better to see Templates as being short-lived, with or without specializations. And if specialization is desired, salting them is the best avenue.
Same applying to qubes and qubes clones for specific use cases. I ten to agree here as well, most of my qubes are cloned for short-lived scenarios. And If I really need to keep track of a specific case, wyng-backup comes to the rescue here, where I personally prefer to have limited number of qubes and work scenarios in the pool, where my tracked states are deduped in my backups, keeping a really long trace of states and deduplications.
I would not see benefits here of keeping my fedora-34 template in pool and trying to dedup it. Of course packages would be completely, and all, be different with now fedora-36. But, I keep them all, deduped, in my wyng-backup. Yes, deduplication on send is expensive (Keeping all hashes to compare before writing instead of pointing to altready existing block for an existing block) is expensive, but doing those backups are offline; when needed. If I wanted to have quick backups, my usage of wyng-backup would change as well, and I would prune old states.
So basically, from that thread, as of today, my own interest for deduplication lowered a bit. Seing it as being deployed by default just vanished, where having it activated by users having similar use cases seems even questionable, if the guideline is to become knowledgeable of the requirements to have enhancements of having deduplication, vs becoming knowledgeable of salt and salting deployments.
Trying to resolve Qubes traditional use cases seem to result, most of the time in the realization that there is no such traditional use case
Still. I thought, from guidelines of documentation, that cloning and specializing templates was what everyone was doing. Like I said before, just deploying whonix-wk and whonix-gw would economize 1gb at install of initial template. And that installing fedora-36 today, and even salting the deployment to specialize templates (communication, proprietary, basic etc) will consume redundant (and deduplicable space) until fedora-37. Then applying salt recipes to specialize will consume lots of space until next release and so on.
I still think that btrfs with bees ( https://github.com/Zygo/bees ) is the direction one should look into if you really need dedupe on Qubes today.
btrfs on Qubes uses the file-reflink driver so the above will likely work.
I can’t speak to whether you might need to modify the dom0 kernel to complete the solution but if you do you might be able to get Qubes development team to take a PR.
@brendanhoar Really interesting. Will have to test. Modified OP to reflect this as the best solution for users cloning qubes and specializing templates. (Not sure how to mark a post as solution, but above reply would be the most practical one (if working) @deeplow )