Ext4 vs. Btrfs performance on Qubes OS installs

Insurgo · September 12, 2022, 6:48pm

As of now, this is the whole thread at SSD maximal performance : native sector size, partition alignment - #30 by rustybird including changes made by @rustybird at initramfs: sector-size agnostic partitioning of volatile volume by rustybird · Pull Request #85 · QubesOS/qubes-linux-utils · GitHub

To make it really high level. And from my basic understanding as of now…
When installing the system, 3 modes are proposed.

LVM, creating fat filesystems where definite volume size is created. Those volumes are created per assumptions based on what the installer, and available tools, are able to get from the hardware.

Thin-LVM creates volumes without costs. This is really interesting because clones in Thin-LVM has no cost. So when you clone qubes, they have no cost until those volumes diverge. And there is no cost but the consumed space of those volumes (their content), where for clones, they refer to their original volumes and are qcow, so they diverge on writes on their thin-lvm themselves.

XFS/BRTFS/ZFS all have similar mechanisms, but since they are reflink, and files on the filesystem, the kernel drivers and pool implementation are the ones instructing how to deal with clones, and LVM mechanisms are not used there. Different implementations, different optimizations.

On file system creation.
For LUKS creation at install, if not hardcoded or properly detected (cryptsetup 2.4 if I recall well, not part of dom0 current fedora), the logical sector size is used, which is still 512 bytes instead of 4k. This is problematic for other tools which will reuse that assumption based on the block level of LUKS to creat the pools on LVM. Then, scripts are either reusing those logical sizes, or hardcoding sector size, depending of what types of volumes passed to the qubes. So rustybird patched volatile file creation so that qubes have the illusion of having a read+write root filesystem. But a problem persists to be able to replicate and tests optimized results. When installing templates at install, the root volume is not 4k. When creating service qubes and default appvms, private voumes are not created with 4k sectors. Some of those passed volumes into qubes (/dev/xvd*) require a partition table, which if misconfigured, will simply refuse to launch installed system.

This is where the discussion is stalled under https://forum.qubes-os.org/t/ssd-maximal-performance-native-sector-size-partition-alignment. @rustybird figured out where the problems lies. Proposed a fix for volatile volume creation and said it would be more complicated to fix private volume and root volume creation. Consequently, I do not know as of now what/how to patch a live iso at runtime (can invest time there but not now) to patch code used to private and root volume creation at install (phase 1 of installer) so that templates are decompressed on top of a correctly configured LUKS partition. But I do not know how to fix code for private volume creation, which happens through salt script against scripts and Xen block related code to actually create service app qubes and default qubes prior of booting into the system. Last time I checked, no qube were launching at boot.

That is the shortest version I can give on the state of that long thread over https://forum.qubes-os.org/t/ssd-maximal-performance-native-sector-size-partition-alignment

Sven · September 12, 2022, 6:57pm

Thank you @Insurgo, but you give me too much credit. I need to go and read about what those things are and what they do … not only in reference to our topic, but in general. Yes, I can use a search engine. Just asking if there is a particular introduction you found helpful.

Insurgo · September 13, 2022, 1:57pm

Hmm. I am not sure where I would start.
Fedora explains why they switched to BRTFS Choose between Btrfs and LVM-ext4 - Fedora Magazine

51lieal · September 14, 2022, 12:02pm

Sorry i’ve been away for a week.

for your device, yes it’s possible to setup like that.

Actually using blake2b make the performance slower, the default using crc32c algorithm, you can use xxhash64 for best speed, but not known if your cpu support.

further details check here.

I have tested that using 4kn drive + 4kn template boost overall performance as I do benchmark about that in the thread @insurgo mention.

the problem you may faced if you use 4kn drive with official iso (512e template) :

With LVM+XFS / EXT4 you wouldn’t be able to finish installation, you need to setup everything manually.
BTRFS doesn’t have problem with it.

And if you do custom iso and build 4kn template, there’ll be no problem.

fiftyfourthparallel · September 17, 2022, 11:20am

Let us know what you find–I really hope 3 second VM startups become the norm someday

renehoj · September 19, 2022, 8:39pm

Do you have thin partitions when using btrfs?

I tried reinstalling with btrfs, and now I’m seeing much higher disk usage in qube manager and when doing backup.

When a qube is on, the disk usage seems to be the size of the template + the size of the appvm, and when it’s off the disk usage is just the size of the appvm.

This has increased the backup size by 300-400% when doing a full system backup.

Are you seeing the same numbers, or did I do something wrong?

rustybird · September 20, 2022, 9:58am

This is due to a difference in what the storage drivers (lvm_thin vs. file-reflink) consider to be a volume’s disk usage - which then leads to weird looking results when Qube Manager unconditionally sums up all volumes of a VM. But it’s “only” cosmetical.

If you mean the size prediction in the GUI backup tool’s VM selection screen, that’s a different cosmetical bug. It shouldn’t affect the actual backup size.

renehoj · September 20, 2022, 10:18am

Okay, I had all VMs added in the backup tool, and the total size got me a little nervous.

Thanks for the explanation.

tasket · October 15, 2022, 5:23am

Sorry to wade into this a bit late, but you’re quite right about the default LUKS sector size… seems sub-optimal.

However, Thin LVM chunk size will have a minimum size of 64KB, and is usually larger depending on the pool LV size at time of creation. My main system uses 64KB despite having a large pool size; I assume this enhances random write performance but haven’t tested it. #write_amplification

On the ‘cost’ of Thin LVM snapshots: Making snapshots is essentially no cost, but deleting (and oddly enough) renaming snapshots takes a significant amount of time. The latter are processed by the kernel in a single-threaded fashion and I usually see 80-100% CPU for >5s when Qubes or Wyng deletes a large snapshot.

Btrfs - My understanding is that it is extents-based but has a settable minimum sector size via mkfs.btrfs with a default of 4096. I think a good basis for comparison would have LUKS set to 4096, Btrfs at default 4096, and Thin LVM pool at 64KB.

Demi · October 15, 2022, 4:33pm

I agree. Also, be sure you are not using the deprecated file driver. That will have terrible performance no matter what, and is going away as it does not have feature parity with the others.

One possible reason that deleting snapshots is so expensive is that Qubes always does a blkdiscard before a lvremove. Thin pools do not handle discards well at all.

Insurgo · October 15, 2022, 6:18pm

@demi what is the state of the loop device PR merge so that benchmarking would make sense under Qubes at some point?

Demi · October 15, 2022, 6:19pm

Merged already, will be in the next vmm-xen release.

Insurgo · October 15, 2022, 6:41pm

@Demi Would be helpful to link with pr and qubes-testing url if goal is to have those fixes known and tested under the testing section of the website…

Otherwise who tests what, really?

Insurgo · October 15, 2022, 7:04pm

@Demi don’t get me wrong on the tone here, but there were a lot of regressions on 4.1 as opposed to 4.0 stability experience.

My point here is that :

Is not enough. I’m following GitHub - QubesOS/updates-status: Track packages in testing repository as close as I can. And I see no vmm-xen to be tested, nor fixes for suspend/resume to be tested, with PR getting way too long to land even in unstable repo. I would expect things to be way more verbose under the testing section of this forum, and my guess is that there is a lot of confusion from even the willing testers to test something to be tested and if those things to be tested even reach willing testers.

How can we improve that should be discussed under the testing section, not here, but this subject will be a good quotation to justify testing discussions, which is why i’m writing it here. No blame or whatever here, but I see a lot of space for improvements through better communication and appropriate pointers.

Insurgo · October 16, 2022, 12:47am

@demi: I see that vmm-xen has PR has been approved.
Created a new post under What to test? Where to get what to test? Where to report testing results? - #3 by Insurgo so that this important package uodate is properly tested by the testing community.

Please lets continue “testing” discussions process over there.

Demi · October 20, 2022, 6:25pm

Can you run tests in these configurations?

LVM thin provisioning + XFS, with the lvm_thin storage driver.
LVM thick provisioning + XFS, with the reflink storage driver and --direct-io=on passed to losetup in /etc/xen/scripts/block.
LVM thick provisioning + XFS, with the reflink storage driver and --direct-io=off passed to losetup in /etc/xen/scripts/block.
BTRFS + Blake2b, with the reflink storage driver and --direct-io=on passed to losetup in /etc/xen/scripts/block.
BTRFS + Blake2b, with the reflink storage driver and --direct-io=off passed to losetup in /etc/xen/scripts/block.

If you are using a 4Kn disk, skip 2 and 4 as they won’t work (you won’t be able to boot any VMs).

rustybird · October 21, 2022, 2:23pm

~~* 4K dm-crypt~~ Sorry - @Demi was right, a 4Kn disk is generally problematic even without 4K dm-crypt.

Configuration 4 definitely works, at least in the sense that it doesn’t produce an error, and losetup -l shows the intended result of direct I/O + 512B sectors even though the underlying dm-crypt block device is 4K. I’ve been using this configuration (except with the default checksum function) for months.

It’s XFS (configuration 2) and ext4 that are not so flexible.

Sven · October 23, 2022, 4:06am

I think it’s explained by…

So it’s in line with @Insurgo’s observations.

tasket · February 27, 2023, 3:50pm

I decided to do a Btrfs install with the recommendations from this and the “SSD maximal performance” threads, and settled on the idea of formatting a two-device Btrfs fs with the options -O no-holes --csum xxhash for better efficiency. All on top of a 4K aligned LUKS partition, using GPT/gdisk.

The Btrfs part turned out to be a fools errand, as neither anaconda nor kickstart seem to support passing custom options to mkfs and anaconda/blivet insist on not installing into an existing fs.

So instead of doing a full dom0 root+everything Btrfs setup, I installed Qubes with a 25GB XFS partition and am now configuring custom Btrfs partitions to hold all domU stuff. If qvm-pool cooperates, I’ll be sitting pretty on my test system, also with Linux kernel 6.1 or 6.2 which have Btrfs optimizations that should greatly impact large-file access.

51lieal · February 27, 2023, 6:03pm

can you be specific? after manually configuring via tty do rescan drive, blivet will read it.