SSD maximal performance : native sector size, partition alignment

51lieal · March 18, 2022, 10:20pm

4096 benchmark :

CPU : I7-10750H
Storage : WD SN 730 512GB
File System : LVM+XFS

Linux dom0 5.10.90-1.fc32.qubes.x86_64 #1 SMP Thu Jan 13 20:46:58 CET 2022 x86_64 x86_64 x86_64 GNU/Linux

Startup finished in 8.740s (firmware) + 2.472s (loader) + 2.947s (kernel) + 7.879s (initrd) + 3.588s (userspace) = 25.628s
Startup finished in 8.720s (firmware) + 2.469s (loader) + 2.947s (kernel) + 7.896s (initrd) + 3.679s (userspace) = 25.713s
Startup finished in 5.331s (firmware) + 2.479s (loader) + 2.947s (kernel) + 8.438s (initrd) + 3.619s (userspace) = 22.816s 

5.75
4.60
4.59
4.61
4.59

# directio
Finished, time 13:27.041, 486745 MiB written, speed 603.1 MiB/s

# no directio
Finished, time 12:29.073, 486745 MiB written, speed 649.8 MiB/s

There seems to be an error running reencrypt without direct io which shows 50MiB in 512 sector size, probably a hardware error, but I’m clearly seeing a 500h++ ETA, hmm…
But as you can see in the new benchmark on 4096 without direct io, it looks better.

I have try browsing, watching movie, and etc, everything seems fine.

FYI, loop device still using 512b sector size, I have a workround for that, but i think its complicated for non tech person to apply. if you want to open an issue, kindly open this too.

and let’s see for part 2 in the Firecuda nvme, maybe i will make a guide for this.

Demi · March 21, 2022, 2:21pm

Please file an issue for this.

Demi · March 21, 2022, 2:23pm

Qubes OS should definitely default to 4096 byte sectors unless it has reason to believe a different sector size is better. My understanding is that 512 byte sectors are almost always emulated nowadays, with the actual sector size being 4096 bytes.

51lieal · March 21, 2022, 3:08pm

Will do.

yes fs reformat would automatically use 4096 if the sector size already use 4096, i have at least 4 ssd with 4k support, but none of them are using 4096 as default, perhaps it’s because compability issue, that’s why many vendor not use it as default…

current installation of qubes uses cryptsetup version < 2.4, which doesn’t automatically use sector size 4096 on luks. RFC: Default sector size (!135) · Merge requests · cryptsetup / cryptsetup · GitLab.

using 4k on luks should improve performance too as reported here https://www.reddit.com/r/Fedora/comments/rzvhyg/default_luks_encryption_settings_on_fedora_can_be/

but using 4k luks would lead to another error like i mention above, manual configuring is needed, I’ll report and open an issue after ensuring 4k loop device works fine.

51lieal · March 22, 2022, 2:49am

vm boot benchmark :

#full-4096
3.77
3.89
4.03
3.86
3.90

#minimal-4096
3.68
3.67
3.62
3.79
3.68


#full-512b
3.62
3.70
3.82
3.58
3.62

kdisk benchmark :

I don’t see any improvement with 4kn drive in the vm, but i might be wrong, since i don’t do other test.

Insurgo · March 24, 2022, 3:09pm

I would suggest increasing the sample size. Those SSD drives have big memory caches (RAM on board), which will hide the real results until that cache is filled/missed.

Your results already show some differences, while difficult to interpret. Most of your write tests show improvements; while most of your read tests show the opposite.

rustybird · March 31, 2022, 3:09pm

This doesn’t seem to be a problem with Btrfs. I’ve successfully converted my LUKS2 device to 4K sectors.

Insurgo · April 2, 2022, 9:37pm

Tried to give some time here getting errors where cryptsetup complains about alignment being impossible when formatting…

/sys/block/sda/queue/ where:

hw_sector_size: 512
logical_block_size: 512
max_segment_size: 65536
minimum_io_size: 4096
optimal_io_size: 0
physical_block_size: 4096

This is a Ctitical MX500 over SATA2 controller.

Steps:

wipefs -a /dev/sda
gdisk /dev/sda
x (expert)
l 4096 (change alignment to 4096 sectors)
m (return to menu)
o (greate GPT table)
y (accept)
n (new partition)
1
Enter (selects 4096 as first sector. good)
+1GB (for boot partition)
Enter (8300 type linux partition)
n (new partition)
2 (second primary partition)
Enter (chose next sector)
Enter ( choose last sector)
Enter (chooses 8300 Linux filesystem)
write
cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --sector-size 4096 luksFormat /dev/sda2

Still no luck whatever I do. Cryptsetup luksFormat with --sector-size 4096 gives “Device size is not aligned to requested sector size”

wipefs -a /dev/sda
fdisk /dev/sda -b 4096
g (created GPT partition table)
n (new partition)
1
Enter (first sector 256)
+1GB
n
2
Enter
w
The disk is not refreshed even if calling partprobe /dev/sda… Rebooting into ISO
cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --sector-size 4096 luksFormat /dev/sda2
Damn. fdisk never syncs the changes?! damnit.

Redoing gdisk:

wipefs -a /dev/sda
gdisk /dev/sda
o (greate GPT table)
y (accept)
n (new partition)
1
Enter (selects 2048 as first sector?)
+1GB (for boot partition)
Enter (8300 type linux partition)
n (new partition)
2 (second primary partition)
Enter (chose next sector)
Enter ( choose last sector)
Enter (chooses 8300 Linux filesystem)
write
Still not aligned.

wipefs -a /dev/sda
cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --sector-size 4096 luksFormat /dev/sda
Works… Alignments are wrong if partitoning with sfdisk and gdisk.

Some notes:

https://linux-blog.anracom.com/2018/12/03/linux-ssd-partition-alignment-problems-with-external-usb-to-sata-controllers-i/

---- Edit of what worked
partprobe without drive specification worked… Weird but nice.
CTRL-ALT-F2 (console)

wipefs -a /dev/sda
fdisk /dev/sda
n (new partition)
1
p (primary)
Enter (first sector 2048)
+1GB
n
2
p (primary)
Enter
w
partprobe
cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --sector-size 4096 luksFormat /dev/sda2

Then resuming your instructions

cryptsetup luksOpen /dev/device luks
pvcreate /dev/mapper/luks
vgcreate qubes_dom0 /dev/mapper/luks
lvcreate -n swap -L xxG qubes_dom0 ( ex : 8G / 16G )
lvcreate -T -L 20G qubes_dom0/root-pool
lvcreate -T -l 90%FREE qubes_dom0/vm-pool
lvs (to check your vm-pool size)
lvcreate -V20G -T qubes_dom0/root-pool -n root ( Why 20G is not enough for simple recovery if not multiple templates being installed at the same time from dom0? I a not ready to reserve 40G for dom0 and preferred when it was in the same vm-pool to grow dynamically with better warnings, but agree of it being in distinct pool now. )
lvcreate -VxxxG -T qubes_dom0/vm-pool -n vm ( ex : -V60G / -V360G )
mkfs.ext4 /dev/qubes_dom0/vm (no need to specify sector size, if your disk is already use 4096)

ctrl + alt + f6 (return to installer)
enter disk and rescan, choose drive, custom (not blivet), click unknown, and set :

1 GiB > format > ext 4 > mount point /boot > update
40 GiB > format > xfs / ext 4 > mount point / > update
(swap) > format > swap > update
leave qubes_dom0/vm alone.
click done. Accept changes, Begin installation.
Reboot.

Second stage install: went ahead and configured system as wanted.
Templates installs, but install fails at configuring sys-firewall. On reboot, all vms are there but none starts properly.
When looking at /dev/mapper/sys: sys-net, sys-firewall, sys-usb related filesystems were not created, which of course fails when starting VMs.

@51lieal

after boot : don’t configure anything, click done, and login.

Any insights on why continuing second stage install doesn’t work at that point?

51lieal · April 3, 2022, 4:45am

I’m try to answer based on what question i found there.

fdisk definitely work if you installing using bios, and gdisk for uefi.

partition tables can expect 1MiB offset for the begin of the first partition, means 2048 sectors for 512b or 256 sectors for 4kb disks.

so when you fdisk there, the first sector should show the default first sector 256 which is fine and good to use.

actually 20gb is enough if we can manage what data on root partition, as example sometimes we install 2-3 template at once it could cause template install fails because not enough space in dom0, and some users here experience it (back then when everyone was fail installing kali template), and i think you might fail too, since 20gb is not enough for installing 4 default template, except you install 1 by 1, and delete previous data.

If you want to give btrfs a try it also good, everything is work out of the box. for the layout you can find here just ignore 1-2 thing there in the drive section.

rustybird · April 3, 2022, 12:52pm

As far as I understand, the reason why 4K dm-crypt breaks some VM volumes on LVM Thin but not on Btrfs is a combination of two things.

LVM Thin uses the same logical sector size as the underlying (dm-crypt) block device. And then a 4K LVM Thin block device in dom0 results in a 4K xen-blkfront block device in the VM, because Xen automatically passes through the logical sector size.

Whereas file-reflink layouts like Btrfs use loop devices, which are currently always configured (by /etc/xen/scripts/block from Xen upstream) with 512 byte logical sectors - again passed through to the VM.
The “root” xvda and “volatile” xvdc volumes don’t properly work with 4K sectors because they are disk images containing a GPT/MBR partition table, which can only specify sizes and locations in sector units:
- The VM initramfs script formatting “volatile” on every VM start currently assumes that a sector is 512 bytes, which should be straightforward to fix (WIP)
- It’s going to be more difficult to somehow make the “root” volume sector-size agnostic…
(The “private” xvdb and “kernel” xvdd volumes seem to work fine if /etc/xen/scripts/block is patched to configure them with 4K sectors. They’re just ext4/ext3 filesystem images without a partition table.)

51lieal · April 3, 2022, 2:15pm

Then why vm-pool is fine while varlibqubes not, both are using same driver, it could be because of my 4kn template, but i don’t think so, I haven’t recheck since then.
if you see this ss below, It already use 4k sector.

rustybird · April 3, 2022, 3:54pm

I don’t get the question. What does fine mean? And aren’t you benchmarking 512-byte sectors on XFS/file-reflink varlibqubes, vs. 4K sectors on LVM Thin vm-pool (using IIUC a custom partitioned TemplateVM root volume) - which would be two very different storage drivers? Oh, your vm-pool is XFS/file-reflink on top of LVM Thin? Okay that would be the same Qubes storage driver then, but it’s still a different (and unusual) storage stack.

Insurgo · April 5, 2022, 7:26pm

@rustybird: This is really interesting!!! Please poke me on updates of this. Won’t land under Qubes before next release for sure, but this is really pertinent advancement in my opinion, even more if applying to default partition scheme (thin lvm, seperated root/vm pools).

augsch · May 20, 2022, 5:19pm

I recently installed qubes exactly as what you have described, creating partitions, specifying qvm-pool, manually installing templates, etc. It all worked well, and I’m grateful for your instructions.

However, I could’t get any of VM to start. In their log, I saw that they complained about the filesystem of /xvdc, as you have described in that GitHub issue. I think that line of qvm-pool command was intentioned to avoid this ( by using lvm thin pool, as you said on GitHub), but unluckily it didn’t work for me.

Should I reinstall qubes, or should I build 4kn templates and find a way to transfer them into dom0 without any VM running? Thank you!

Btw, my self-built 4kn template also fails to start for the same reason, in qubes on a 512e ssd.

51lieal · May 21, 2022, 10:53am

Well i need more detailed what step by step you have done, but let’s see next week, I’ll try to create a guide from changing lbaf to using 4k template.

Insurgo · September 12, 2022, 12:51pm

rustybird:

As far as I understand, the reason why 4K dm-crypt breaks some VM volumes on LVM Thin but not on Btrfs is a combination of two things.

LVM Thin uses the same logical sector size as the underlying (dm-crypt) block device. And then a 4K LVM Thin block device in dom0 results in a 4K xen-blkfront block device in the VM, because Xen automatically passes through the logical sector size.Whereas file-reflink layouts like Btrfs use loop devices, which are currently always configured (by /etc/xen/scripts/block from Xen upstream) with 512 byte logical sectors - again passed through to the VM.

The “root” xvda and “volatile” xvdc volumes don’t properly work with 4K sectors because they are disk images containing a GPT/MBR partition table, which can only specify sizes and locations in sector units:

The VM initramfs script formatting “volatile” on every VM start currently assumes that a sector is 512 bytes, which should be straightforward to fix (WIP)

It’s going to be more difficult to somehow make the “root” volume sector-size agnostic…(The “private” xvdb and “kernel” xvdd volumes seem to work fine if /etc/xen/scripts/block is patched to configure them with 4K sectors. They’re just ext4/ext3 filesystem images without a partition table.)

@rustybird Any conclusion/updates/findings?

rustybird · September 12, 2022, 1:24pm

Only this pull request:

github.com/QubesOS/qubes-linux-utils

initramfs: sector-size agnostic partitioning of volatile volume

QubesOS:master ← rustybird:volatile-agnostic

opened 12:23PM - 19 May 22 UTC

rustybird

+18 -18

If the volatile volume is presented to the VM as a 4K sector device (which this …PR does *not* do - but I tested it by [patching](https://gist.github.com/rustybird/691f78e112c64ed296f4e990fc5b2ef4) `/etc/xen/scripts/block` on a file-reflink installation), ensure that it won't result in nonsensical partitioning. https://forum.qubes-os.org/t/ssd-maximal-performance-native-sector-size-partition-alignment/10189/30

Insurgo · September 12, 2022, 6:05pm

@rustybird Sorry I was not more specific: I meant for the root and private volumes creation: was that tested working?

So if I understand well, I could apply your patch and have volatile volume fixed. But for creating root volumes and private volumes, I would need to build ISO, or patch stage 1 and stage 2 install so that when templates are decompressed, those are fixed to create a working system to be able to compare performance properly with/without the fixes.

I was looking for next steps to get main devs attention in seeing actual performance losses/ differences in this thread.

Otherwise, people are trying to get away of LVM thin provisioning model at install as of now. Some wants ZFS,XFS/BRTFS since speed differences are quite important.

One example of that is from @Sven at https://forum.qubes-os.org/t/ext4-vs-btrfs-performance-on-qubes-os-installs as an example of that, showing gains of ~300mb/s write speed by choosing BRTFS at install vs thin provisioning default:

Ext4 vs. Btrfs performance on Qubes OS installs

So the following numbers are from a controlled environment, run on the same hardware with the same bits present. Nothing is different except the file system. I think it is justified to call the delta “significant”.

start write

ext4 9.35 s 466.8 MB/s

Btrfs 7.74 s 765.3 MB/s

Δ 1.61 s (21%) 298.5 MB/s (64%)

actual measurements

ext4 start = 9.09, 9.44, 9.51, 9.64, 9.37, 9.24, 9.42, 9.14, 9.42, 9.25 s
ext4 write = 383, 424, 463, 467, 501, 446, 425, 522, 521, 516 MB/s

Btrfs start = 7.70, 7.73, 7.67, 7.68, 7.71, 7.58, 7.82, 7.83, 7.82, 7.83 s
Btrfs write = 760, 768, 773, 758, 768, 765, 750, 761, 780, 770 MB/s

Maybe others can do similar measurements when reinstalling and share them in this thread. I’d be very interested to see if these performance gains are general in nature or somehow specific to my hardware. If general in nature, maybe a case should be made to the development team to adopt Btrfs as a default.

Fixing LUKS+LVM thin provisioning would be great. Otherwise LVM is blamed for performance losses as of now where other implementations are simply not suffering from the same implementation flaws that LVM thin provisioning is suffering from, per Qubes implementation of volatile, private and root volumes creation.

@Demi maybe? I think @rustybird showed where love is needed here: SSD maximal performance : native sector size, partition alignment - #30 by rustybird

rustybird · September 15, 2022, 1:43pm

Not sure that I understand your question, but standard (i.e. not in like a standalone HVM) private volumes are already sector-size agnostic in their content, so compatibility wise it doesn’t matter whether they are presented to the VM as 512B or 4KiB block devices.

Standard root volumes have sector-size specific content, and I don’t think it’s feasible to dynamically patch that volume content (specifically, the partition table) in dom0, because it contains untrusted and potentially malicious VM controlled data.

Backward compatibility is a real headache here. It seems like the existing root and private volumes should simply be presented to the VM as 512B devices by default for now. In the case of an LVM installation layout, that might even entail forcing 512B sectors for the whole LUKS device - unless there’s a good way to set an independent sector size for the LVM pool or ideally per LVM volume.

Insurgo · October 15, 2022, 3:47pm

Cross-referencing important post by @tasket (one filesystem knowledgeable person with a lot of hands on experiment, behind wyng):

Ext4 vs. Btrfs performance on Qubes OS installs

Sorry to wade into this a bit late, but you’re quite right about the default LUKS sector size… seems sub-optimal.

However, Thin LVM chunk size will have a minimum size of 64KB, and is usually larger depending on the pool LV size at time of creation. My main system uses 64KB despite having a large pool size; I assume this enhances random write performance but haven’t tested it. #write_amplification

On the ‘cost’ of Thin LVM snapshots: Making snapshots is essentially no cost, but deleting (and oddly enough) renaming snapshots takes a significant amount of time. The latter are processed by the kernel in a single-threaded fashion and I usually see 80-100% CPU for >5s when Qubes or Wyng deletes a large snapshot.

Btrfs - My understanding is that it is extents-based but has a settable minimum sector size via mkfs.btrfs with a default of 4096. I think a good basis for comparison would have LUKS set to 4096, Btrfs at default 4096, and Thin LVM pool at 64KB.

	start	write
ext4	9.35 s	466.8 MB/s
Btrfs	7.74 s	765.3 MB/s
Δ	1.61 s (21%)	298.5 MB/s (64%)