EDIT: See bottom of this post for pointers to discussions points.
Conclusion: deduplication doesn’t seem to be desired as a default option to resolve space consumed by clones over a lifetime of a Qubes installation.
What is recommended here again is to salt specialization of templates if those are required, or to minimize specialization of templates to cancel redundancy of consumed space if that becomes a problem in time. Reminding that installing fresh would cancel the problem altogether.
Cloning templates should be again seen as a place for experiments. And bringing the result of that successful experiment into origin, and deleting clone is the desired outcome for space effectiveness. Best: sharing that salt recipe should probably shared with the community. That seems to be the conclusion of this thread: deduplication might be costly for its benefit; while its benefits would be felt by people not managing their qubes/templates/clones effectively over time. Restarting fresh is still the best advise here.
This post will be edited multiple times. Trying to wrap my head around what kind of optimizations, or filesystem/pool choices, could land under Qubes for users following actual Qubes best practices of cloning templates to specialize their usages without having exponential storage cost if those clones are long lived, have the same packages updates deployed over all clones and diverge over time and naturally since they fill a different use case.
I thought this post was a relevant place to expose the problem of doing so:
But I seem to have confused the OP and the answer I got is basically that LVMs are unfit and basically requires from the user to discard specialized templates and restart from scratch from scripts/salt:
Yes, apt-cache (cacher) can be used to download the same package update once, and install it multiple times, but the result is that thin LVMs are not really “thin” anymore, and across the thin LVM pool, the LVMs (the specialized clones) will consume the same space multiple times.
Following that guideline in the traditional, default installation will result in clones at moment 0 to have no cost at all. But from the moment the origin of the clone and the clones of the clones receive updates, all those volumes increase in size exponentially.
From my current understanding (this is why I open the discussion) LVM is not made to have native inter-lvm (pool level) deduplication.
So the question here is: Is there any other more effective, and desired, alternatives not requiring from end-users to scrap their templates and start from scratch all the time?
Here some notes on actual research. @brendanhoar, maybe here is a more better place to have your filesystem/pool opinion (@Demi did you get answers in that area in your quest?)
- LVM
- VDO on LVM : unfit for template
- This technology is aimed at removing duplicates inside of a same LVM. Multiple copies of an ISO will not consume multiple times its size.
- Worst, if one backups an unused template, and then restores it later in time, an unaware user might explode its vm-pool when restoring multiple templates backups.
- wyng-backup can limit that, but application is manual right now. A user doing a clone of the most up-to-date template, and “receiving” a backup volume with
--sparse-write
will actually only inject in the thin provisioned volume clone the differences (blocks) when compared to its origin. - thin-provisioning-tools maintainer is implementing a backup/restore mechanism implementing a similar mechanism
- wyng-backup can limit that, but application is manual right now. A user doing a clone of the most up-to-date template, and “receiving” a backup volume with
- Conclusion: nothing currently exist on the pool level to deal with inter-LVMs, pool level deduplication.
- VDO on LVM : unfit for template
- ZFS : native pool level deduplication exists and is applied on live systems
- Deduplication Table (DDT)
- seems really effective
- Cost for deduplication is between 1-2gb of ram per 1tb of storage
- NOT CONSENSUAL. Might be way lower
- Tivoli (ibm. Proprietary?) : native pool level deduplication exists
- BRTFS : not fit for live deduplication Local. Dedup is applicable offline
Other solution exists?
Notes from the current threads:
- Deduplication need for qubes usage doesn’t seem consensual.
- Cloned qubes are not consensually used either.
- Memory usage for deduplication where dedupped ratio is low will increase without being useful.
- Qubes upstream decision tend to limit memory consumption for good reasons and will probably never enable deduplication by default even if LVM pool switched to ZFS on default installations.
- @Rudd-O restated his desire to have a ZFS pool implementation since file based pool are to be deprecated soon.
- OpenZFS and reflink support is not yet a thing.
The general advices to mitigate need for pool-level, automatic and live deduplication of redundant data between pool’s volumes is still
- Noting down manual package deployments on top of freshly downloaded templates, scripting those addition or best: salting those customizations so that starting fresh is always an option (I tend to agree as well… Since Q4.1 release, we passed from fedora-34 to fedora-35 to fedora-36 in just couple of months. This is tiresome and I am also personally learning salt now and pushing to have salt repositories. I think this might be a better investment for the whole community in the long run then to trust dedup might help while it seems it might not on different Qubes use cases).
- The big warning on all deduplication documentation I have read so far is (all the bold warnings on this old article seem all still valid) : make sure your use case will benefit of deduplication prior of activating it. Otherwise the costs of it are way bigger then the benefits. And going back requires also pool recreation.