Pool level deduplication?

Insurgo · July 19, 2022, 7:33pm

EDIT: See bottom of this post for pointers to discussions points.
Conclusion: deduplication doesn’t seem to be desired as a default option to resolve space consumed by clones over a lifetime of a Qubes installation.

What is recommended here again is to salt specialization of templates if those are required, or to minimize specialization of templates to cancel redundancy of consumed space if that becomes a problem in time. Reminding that installing fresh would cancel the problem altogether.

Cloning templates should be again seen as a place for experiments. And bringing the result of that successful experiment into origin, and deleting clone is the desired outcome for space effectiveness. Best: sharing that salt recipe should probably shared with the community. That seems to be the conclusion of this thread: deduplication might be costly for its benefit; while its benefits would be felt by people not managing their qubes/templates/clones effectively over time. Restarting fresh is still the best advise here.

This post will be edited multiple times. Trying to wrap my head around what kind of optimizations, or filesystem/pool choices, could land under Qubes for users following actual Qubes best practices of cloning templates to specialize their usages without having exponential storage cost if those clones are long lived, have the same packages updates deployed over all clones and diverge over time and naturally since they fill a different use case.

I thought this post was a relevant place to expose the problem of doing so:

Is it possible to have a root template from which other templates branch out and specialize?

                             ,---> deb11min-net (Template) ---> deb11min-net-dvm (Disposable Template) ---> sys-net (Dispoasble AppVM) 
debian-11-minimal (Template)  ---> deb11min-fw (Template)  ---> deb11min-fw-dvm (Disposable Template)  ---> sys-firewall (Disposable AppVM)
                             \---> deb11min-usb (Template) ---> deb11min-usb-dvm (Disposable Template) ---> sys-usb (Disposable AppVM)

But I seem to have confused the OP and the answer I got is basically that LVMs are unfit and basically requires from the user to discard specialized templates and restart from scratch from scripts/salt:

Yes, apt-cache (cacher) can be used to download the same package update once, and install it multiple times, but the result is that thin LVMs are not really “thin” anymore, and across the thin LVM pool, the LVMs (the specialized clones) will consume the same space multiple times.

Following that guideline in the traditional, default installation will result in clones at moment 0 to have no cost at all. But from the moment the origin of the clone and the clones of the clones receive updates, all those volumes increase in size exponentially.

From my current understanding (this is why I open the discussion) LVM is not made to have native inter-lvm (pool level) deduplication.

So the question here is: Is there any other more effective, and desired, alternatives not requiring from end-users to scrap their templates and start from scratch all the time?

Here some notes on actual research. @brendanhoar, maybe here is a more better place to have your filesystem/pool opinion (@Demi did you get answers in that area in your quest?)

LVM
- VDO on LVM : unfit for template
  - This technology is aimed at removing duplicates inside of a same LVM. Multiple copies of an ISO will not consume multiple times its size.
- Worst, if one backups an unused template, and then restores it later in time, an unaware user might explode its vm-pool when restoring multiple templates backups.
  - wyng-backup can limit that, but application is manual right now. A user doing a clone of the most up-to-date template, and “receiving” a backup volume with --sparse-write will actually only inject in the thin provisioned volume clone the differences (blocks) when compared to its origin.
  - thin-provisioning-tools maintainer is implementing a backup/restore mechanism implementing a similar mechanism
- Conclusion: nothing currently exist on the pool level to deal with inter-LVMs, pool level deduplication.
ZFS : native pool level deduplication exists and is applied on live systems
- Deduplication Table (DDT)
- seems really effective
- Cost for deduplication is between 1-2gb of ram per 1tb of storage
  - NOT CONSENSUAL. Might be way lower
Tivoli (ibm. Proprietary?) : native pool level deduplication exists
- IBM Documentation
BRTFS : not fit for live deduplication Local. Dedup is applicable offline
- Deduplication - btrfs Wiki
- reflink pool seems fit for Qubes use cases with on-demand dedup

Other solution exists?

Notes from the current threads:

Deduplication need for qubes usage doesn’t seem consensual.
- For example, one is not using multiple clones of templates and specialization of them.
Cloned qubes are not consensually used either.
Memory usage for deduplication where dedupped ratio is low will increase without being useful.
Qubes upstream decision tend to limit memory consumption for good reasons and will probably never enable deduplication by default even if LVM pool switched to ZFS on default installations.
@Rudd-O restated his desire to have a ZFS pool implementation since file based pool are to be deprecated soon.
OpenZFS and reflink support is not yet a thing.

The general advices to mitigate need for pool-level, automatic and live deduplication of redundant data between pool’s volumes is still

Noting down manual package deployments on top of freshly downloaded templates, scripting those addition or best: salting those customizations so that starting fresh is always an option (I tend to agree as well… Since Q4.1 release, we passed from fedora-34 to fedora-35 to fedora-36 in just couple of months. This is tiresome and I am also personally learning salt now and pushing to have salt repositories. I think this might be a better investment for the whole community in the long run then to trust dedup might help while it seems it might not on different Qubes use cases).
The big warning on all deduplication documentation I have read so far is (all the bold warnings on this old article seem all still valid) : make sure your use case will benefit of deduplication prior of activating it. Otherwise the costs of it are way bigger then the benefits. And going back requires also pool recreation.

Current conclusion BRTFS with on-demand dedup seems to be the most fit use case for users cloning qubes and specializing templates over a lifetime of a Qubes used release

brendanhoar · July 19, 2022, 9:49pm

A handful of comments, with the assumption that general space usage discussion is useful feedback (let me know if you feel it is off-topic @Insurgo).

The amount of additional pool space “unnecessarily” used due to updates in templates tends to be very small for the vast majority of default packages. The one exception I can think of is the in-template kernel updates which are currently entirely unused by most Qubes users, due to the prevalence of the default of using dom0-provided kernels.

Locking dnf/apt to the default installed in-template kernel package might be the best way to save space on updates. Might be side-effects though, perhaps needing an unlock on dependency needs.

Cleaning local dnf/apt cache immediately after successful update may help (at cost of a longer update afterward, mitigate with caching proxy).
Keeping in-template logs clean seems like a way to reduce template bloat. I do a sudo journalctl --flush --rotate --vacuum-files=1 && sudo fstrim -av after each successful update before shutdown.
As you mentioned the biggest exposure to bloat is restoring from a Qubes Backup tool backup, which reduces everything that was pre-deduped at thin volume creation.

As of today there’s no Qubes-friendly way to dedupe LVM, but there may be some offline ways in the (near?) future. VDO has multiple issues that make it less than ideal for Qubes. Zfs’s dedupe capabilities are very dependent on installed memory which is already a premium resource in Qubes. btrfs might be a direction to look in?

B

Insurgo · July 20, 2022, 12:40am

Seems like brtfs falls into VDO category, where dedup is within a single filesystem and not pool wide? @brendanhoar?

Insurgo · July 20, 2022, 1:03am

@brendanhoar your opinion differs?

brendanhoar · July 20, 2022, 1:41am

I’m not yet familiar with the Qubes btrfs pool implementation so I don’t know.

E.g. is it stacked such that VM volumes exist as reflinked files in an outer btrfs volume (I.e. using the file-reflink driver)? If so perhaps offline dedupe is possible in that outer volume.

But even if, there are alignment & chunk size issues at play that might not give expected results on diverging templates.

B

Insurgo · July 21, 2022, 7:14pm

@Rudd-O ?

Advices? I linked https://github.com/QubesOS/qubes-issues/issues/7009 to here and asked to have a bounty tag on the issue you opened.

Insurgo · July 21, 2022, 7:56pm

@brendanhoar any insights/opinions/references on ZFS being unfit? Everything I read about it seems to say that it would fit QubesOS use case of multiple templates clones and specialisation, with no penalties if RAM is available when dedup is enabled on the pool.

And deduplication is applied pool-wide, and live (no need to apply dedup on offline volumes, since dedup is applied on writes, hence the memory costs).

brendanhoar · July 21, 2022, 8:16pm

[quote=“Insurgo, post:7, topic:12654”]
any insights/opinions/references on ZFS being unfit?[/quote]

Caveat: I have no direct experience.

ZFS deduplicatiom has always been very memory intensive. Qubes has always been sensitive to memory constraints. Difficult mix to resolve.
Last I am aware of ZFS on Linux has to run in user space which makes it very non-performant (read: very slow). This is mostly due to incompatible licensing, so ZFS had to be implemented as a userspace driver, which leads to a lot of extra security context switching overhead.
There may also be a third party group that has recently integrated it with the Linux kernel but that’s probably a license violation.

B

Demi · July 21, 2022, 10:12pm

ZFS licensing and Linux is complicated. The general consensus is that distributing ZFS using DKMS (which builds the modules on the user’s machine) is legal. The question is whether distributing binary ZFS kernel modules is legal. That is a question for the ITL legal team to deal with, not me.

That said, ZFS deduplication requires so much memory that it may well not be an option in practice. It also seems to have horrible performance problems unless something has changed since that post was written.

VDO seems to have much better performance, and one could implement VDO on top of some LVM layers and below others. Unfortunately, VDO is currently only supported on RHEL-compatible kernels, not the ones used by Qubes OS. Furthermore, VDO has out-of-space handling problems.

Rudd-O · July 21, 2022, 10:57pm

I offered a bounty a long time ago, but I was told that project donations would be redirected to what the project considered appropriate rather than the feature I was going to fund.

Weird claims I see on this post, none of them sourced. Will correct the record on some now.

Re licensing issue: I don’t much care about the licensing problem — I don’t distribute compiled ZFS modules, so for me that’s a nonissue. I am also unaware of anyone actively violating the CDDL or the GPL by doing this.

Re performance: ZFS on Linux / OpenZFS is NOT a user space daemon. That would be ZFS-FUSE, another project, to which I contributed code.

Re deduplication: dedup requires very little memory vis a vis the amount of memory Qubes OS users have and storage size they have.

I still want the ZFS adapter for Qubes pools to be shipped with Qubes OS, whether most users use it or not. ZFS is the absolute best file system I have ever used (I’ve used many) and I don’t plan to move away from it anytime soon. It’s simple to manage (no “layers” shit, like LVM or VDO or LUKS), it’s performant enough, it has compression / snapshots / send+receive, it has been tested way more than any other file system on Earth… what’s not to like about these things?

Insurgo · July 22, 2022, 1:26am

@demi do you have source? As stated in OP sources I’ve found (not exhaustive, but trying to bonify them here):

I agree with @Rudd-O:

If we talk about a 1tb of non-deduplicated storage space that consumes 1gb ram, i would use that. If backuping/restoring would only also take dedupped space, I would also use that.

On VDO and what I understood of it, it is deduplicating on block (filesystem) layer, not pool-wide:

Insurgo · July 22, 2022, 1:49am

From own quoted source:

WHAT IS THE DEDUP TABLE (DDT)?
AND… IS IT STORED IN RAM, OR ON DISK?

This is a common point of confusion.

When deduplication is used, the dedup table is part of the way that data is stored in the pool. ZFS uses a hashed list of blocks, to allow easy identification of duplicate blocks. In simple terms, to find an actual block of data on disk, ZFS uses the DDT as an extra step.

The DDT is a fundamental pool structure used by ZFS to track what blocks make up what files, when dedup is used. It’s as much a part of the pool as the dataset layout, the snapshot info, pointers to files, or the file date/time metadata. If you lose it, your pool is dead. If ZFS needs it, it reads it from the pool on demand, and uses the data contained to identify not just duplicate blocks, but also to find data on disk. So the dedup table is not a cache or an extra (like ZIL or L2ARC), that gets stored in RAM and if we lose it. too bad. If DDT data is in RAM or L2ARC, it’s only there temporarily, like any other in-use pool data.

In other words, for all practical purposes you can think about ZFS handling dedup metadata identically to any of that sort of stuff, if that helps. It’s integral to the pool. And the pool won’t work well if it can’t access the DDT fast, when needed.

Modifying OP accordingly and posts sayimg 1tb of atorage deduped data needs 1gb ram. Basically as of now, @Rudd-O knows best being a user of ZFS.

Mind to share your experience and use cases? How does it behave on Qubes intended use case? (Deduped templates, or, specialized clones?)

Also, @Rudd-O, any comment on performance hug for combined read and writes from referred article, writer saying that special SSD are required (not to go for Samsung EVO pro devices)?

Author recommends optane SSDs:

Optane and pure battery backed RAM cards only .
I should clarify: That’s nothing to do with SSDs having too-small DRAM or SLC cache. It’s inherent in the SSD NVRAM chips themselves. Because it’s nothing to do with the device cache type or size, a “better” SSD or one with “better” or no cache, won’t help much.

Demi · July 22, 2022, 2:04am

That is a very valid trade-off (to each their own ) and would definitely justify supporting it, especially on high-end systems with 32GiB or more of RAM.

VDO is block-layer deduplication. One puts it below filesystems or LVM thin provisioning, and above encryption and/or RAID. Therefore, it works across volumes. One would need to patch LVM to allow VDO to be used as a thin-pool data volume, however, as LVM does not allow this because of out-of-space recovery problems. Recovery would be very difficult and would at a minimum require adding more storage to the VDO volume, which can never be removed.

If this statement is accurate, then I do not believe the Qubes team can recommend deduplication. The Optane SSDs used by the author of that post have been discontinued, and the only pure-X-Point storage available is the enterprise drives. Those are extremely expensive, so I doubt they will be an option for the majority of people.

Insurgo · July 22, 2022, 2:21am

@Demi articles contradicts themselves. The article you referred talked about old OpenZFS implementations as well and referred to SSDs that were pre-2020 (which explains recommendation for hardware not existing anymore too) . Mixed random reads and writes are more common but yet again, i’m no expert here but could experiment. And again, his use case is on multiple terabytes of data, which is not the use case of Qubes, at all.

@Rudd-O I see you restated your desire to fund a pool driver for OpenZFS under Qubes. Looking forward to read your opinion here, and glad you replied on opened PR. and related issue.

Insurgo · July 22, 2022, 2:43am

@Demi @Rudd-O @brendanhoar do we know other filesystem/pool knowledgeable people that should be tagged here to bonify the discussion?

As current state of knowledge gathering, it seems still that OpenZFS would still be the best candidate for live pool level deduplication use case.

That is, taking templates clones and their specializations without growing required disk space would be covered. Let’s say that that pool could even be separated for Templates+Standalones from vm-pool if that idea eases your mind. As a starter, that would mean, at installation, that whonix-workstation and whonix-gateway would be mostly deduplicated instead of duplicated+differences.
Then extend that concept to having minimal-debian and minimal-fedora deployed, and then specialized. As I see it, we could reduce template costs by a huge ratio just on that level alone.

As for qubes clones, I do that a lot for development, with a lot of built stuff i can reuse and also specialize. No cost at cloning today, but exploding costs after cloning from divergence between origin and clones, where origin also changes overtime which is where costs explodes. Plus, and foremost, I cannot go back at using qubes backup tool bit for specific needs anymore for the same reasons: restoring backups explode the consumed space, where I now use wyng-backup instead, with scripts to clone from something close to it, then inject changes from wyng-backup, which also dedups on send, economizing on both sides: backups and restored(pool) space.

I would be really curious to experiment OpenZFS and live dedup for Qubes OS. And would probably use it myself, being able to keep clones on disk longer, without needing to move them away with wyng-backup.

That means that having this activated on the pool level (which can be activated on pool after deployment) would show gain in space with a tradeoff on cpu/memory, when activated only.

Rudd-O · July 22, 2022, 11:54am

I don’t use dedup in Qubes. No need to:

Storage is aplenty.
In a Qubes OS use case, both the file pool driver and a hypothetical ZFS pool driver clone disk images of VMs, meaning the only additional storage you’d use is always reduced to a delta between the base image and anything you’ve written to it.

That said, if you were to e.g. store the same ISO image in the /rw partition of, say, five VMs — I donno why you’d do that, but you could — dedup would ensure you’d use 1x the ISO size on disk plus the minimal space to account for the 5X block references (tiny).

Also, you absolutely don’t need fancy SSDs or HDDs to use ZFS. The only use case where fancy SSDs will be of benefit, is in extremely intensive read+write workloads, or perhaps forced sync workloads. Your data is safe (up to the last 30 seconds) under normal workloads, even if your SSD is not battery-backed, thanks to ZFS transactions properly committing dirty buffers to the ZIL every 30 seconds. Your data is safe in all sync workloads because it’s all force-written immediately to the ZIL. The only case your data is not safe (with any file system) is when the disk lies to the OS after the OS has told the disk “commit this immediately”. Don’t buy shitty disks!

I have over 20G of usable disk space across all of my machines, all running with ZFS, some with HDDs, some with SSDs, some with a combination thereof (actual physical disk space is much bigger because all of my workloads are redundant). Never ever have I experienced any data loss since I started using ZFS.

VDO sucks (primarily but not only) because it’s yet another layer in a stack of multiple layers you have to manage. It might or might not actually deduplicate (I tend to believe it does), but the whole thing of piling this and that layer atop the previous layers (disk + lvm pv + lvm vg + lvm lv + vdo + ssd cache + fs)… sorry, makes me want to puke. I much prefer “zpool create” then “zfs create” — bing bang boom you got yourself a file system that can snapshot, deduplicate, checksum, compress, write transactional logs to NVDIMM/SSD, even cache on them.

A note on deduplication:

Deduplication is almost always going to be inferior to e.g. cp --reflink or zfs clone. Why? Because with deduplication, to dedup X blocks, the computer has to read all X blocks, checksum them, then write all X blocks — all with a roundtrip to userspace — after which the file system / device driver can then say “oh, I’ve seen dis b4, lemme jest create a ref to those blocks”. That is a hyoooooge waste of compute, which is only justified if your workload cannot be organized in any other way! If you can get away with cp --reflink or zfs clone (this in particular I expect in a hypothetical ZFS pool driver for qubes), you should almost certainly use that.

brendanhoar · July 22, 2022, 1:22pm

This gets to the crux of the issue: the amount of storage “lost” due primarily to updates to diverging clones of the same Qubes template is rather small. If this concern is the primary driver for wanting deduplication, I think it’s misguided.

Also, with R4.1 of Qubes the new qvm-template-gui tool even lets you know when a more recent build (baseline) of a template is available.

That’s a good opportunity to replace the template and rebuild your forks [via salt (advanced), bash script (intermediate), manual (beginner)]. If you do so and remove your older forks you’ll net gain some space back, plus have a cleaner template set.

Which reminds me, today would be a good time to set up a script to customize a print-template fork for using our our laser printer via USB, so that I don’t have to do it manually next time. I’m getting a lot of requests to print out Wall-E coloring pages this week…and my client is quite demanding on timelines (he’s four years old).

B

Insurgo · July 22, 2022, 3:58pm

Maybe my usage is not representative (clones of qubes with lotsa build artifacts, to be specialized to economize building time, plus specialized templates).

Here again, it seems that the recommended way is again to:

And have those qubes and unused (bit rotting) templates backuped with wyng-backup (deduped on send) where user needs to be conscious at restore (since not deduped).

Rudd-O · July 22, 2022, 4:02pm

diverging clones of the same Qubes template

Precisely.

I have the base Fedora X template that Qubes ships, and a full clone of the template which is my main and only template, because I don’t need more. I could at any point in time erase the template theat Qubes OS ships, and I would reclaim what, 5 GB? My actually-used template is so far updated that pretty much nothing is in common with the base Fedora X template, so dedup would gain me exactly zero bytes, and waste some RAM for no good reason.

Currently, dedup is probably not worth using — in particular with a proper ZFS pool driver — except in niche cases. I could see how dedup would be useful after Qubes 4.1 when the file driver is deleted and the only thing remaining is the file-reflink driver (which under Qubes would make full copies of VM images, which then would benefit from dedup), but using the file-reflink driver with ZFS will be unbearable simply because it would take forever just to start a VM (ZFS master does not support reflink copies right now, and support for reflink is currently under a lengthy review process).

Insurgo · July 22, 2022, 5:07pm

@Demi @Rudd-O @brendanhoar : I have modified OP to point to comments that happened after. If you disagree with the short versions of your sayings, please correct me where I went wrong.

What I understand as of today is the deduplication would resolve (consensus is not here) our diverging use cases. demi would benefit of it (clones of qubes, clones of templates. Restore of templates and qubes (wyng backup restoration). Some of us are not cloning templates. There is not even consensus on the benefits of deduplication vs reflinks in this thread.

I understand that OpenZFS pool support would be a matter of simplicity vs current LUKS+VG+LVM to obtain similar results without current complexity.

I understand as well, from my reading, that VDO would add even more complexity to the current setup and would not be directly usable from Qubes perspective either (and from my reading VDO was unfit in documented use cases where it was applied in LVM not at at pool level, but I might have missed something).

So the conclusion as of today of this thread is that it is better to see Templates as being short-lived, with or without specializations. And if specialization is desired, salting them is the best avenue.

Same applying to qubes and qubes clones for specific use cases. I ten to agree here as well, most of my qubes are cloned for short-lived scenarios. And If I really need to keep track of a specific case, wyng-backup comes to the rescue here, where I personally prefer to have limited number of qubes and work scenarios in the pool, where my tracked states are deduped in my backups, keeping a really long trace of states and deduplications.

I would not see benefits here of keeping my fedora-34 template in pool and trying to dedup it. Of course packages would be completely, and all, be different with now fedora-36. But, I keep them all, deduped, in my wyng-backup. Yes, deduplication on send is expensive (Keeping all hashes to compare before writing instead of pointing to altready existing block for an existing block) is expensive, but doing those backups are offline; when needed. If I wanted to have quick backups, my usage of wyng-backup would change as well, and I would prune old states.

So basically, from that thread, as of today, my own interest for deduplication lowered a bit. Seing it as being deployed by default just vanished, where having it activated by users having similar use cases seems even questionable, if the guideline is to become knowledgeable of the requirements to have enhancements of having deduplication, vs becoming knowledgeable of salt and salting deployments.

Trying to resolve Qubes traditional use cases seem to result, most of the time in the realization that there is no such traditional use case

Still. I thought, from guidelines of documentation, that cloning and specializing templates was what everyone was doing. Like I said before, just deploying whonix-wk and whonix-gw would economize 1gb at install of initial template. And that installing fedora-36 today, and even salting the deployment to specialize templates (communication, proprietary, basic etc) will consume redundant (and deduplicable space) until fedora-37. Then applying salt recipes to specialize will consume lots of space until next release and so on.