SSD maximal performance : native sector size, partition alignment

Insurgo · March 14, 2022, 2:36pm

Hello everyone.

SSD is not a new technology, but there seems to be some tweaking needed, even nowadays, to be able to setup it for best performance, and those are not done by default even today. Performance gains are reported to be massive. Some reporting gains at random read between 100 MiB/s to some gigantic 500 MiB/s. I was not able to replicate as of now, but wondering if that should be investigated from willing testers, me included.

I’m writing this post for numerous reasons, the first being to have knowledgeable people jump aboard and correct the facts, maybe even from upstream linked documentation.

The reason why this is important, at least for Insurgo, is that deploying an OEM disk image is expected to be re-encrypted by the OEM (so the key encryption disk is unique) and the the end user at reception of the hardware, so that even if the OEM kept a copy of the LUKS header, even on a gag order, the OEM could not disclose the header to authorities so that authorities could restore the Header and type the OEM provisioned Disk Recovery Key passphrase. Re-encrypting drives is probably the only case where massive gains would be directly observable, since this applies gains at each of the important levels: the reads are not random, cache of the SSD drive being stretched to its limit to read chunks in advance, reencrypt it in RAM and then writing it on disk, triggering firmware specific physical block sizes: Erase Blocks, “sector” sizes and partition alignment so that what is read and written is not requiring the firmware to touch other blocks for each read and write operations.

With newer basic testing on Q4.1 installations, it came to my attention that older SSD technologies (chinese unknown OEM included), which were old refurbished SSDs drives coming with received laptops (not EVO PRO 860 or other really good performing SSDs) were actually performing wayyyyy better at re-encryption of the disk through cryptsetup-reencrypt then before. And by a big margin: 80MiB/s second became 145 MiB/s! But why?

It reopened an old, forgotten can of worm for me, since in the past, I made a proof of concept to be able to network clone an OEM disk image across multiple devices using clonezilla in BT cloning mode. The cloning operation worked great in all case, speed being limited mainly by Ethernet cable quality and switch performance. But when arrived at the cryptsetup-reencrypt step from the clients who just had perfect copy of the OEM disk image (in dd mode since nothing can be reinterpreted for the LUKS container, so no modification of the input to be written on destination), the cryptsetup-reencrypt call was having really really poor performance, unless the source disk and destination drives were of the same brand. It was thought that clonezilla tool stack was to blame, since doing a physical clone (black box here, unfortunately) was not showing the same result. And I stopped investigating there since I had a working solution, while not scaling as great as clonezilla would, being only limited by network switch performance, all nodes participating on clonezilla Bittorent mode validating chunks downloaded and swarming locally, leaving the cloning of the drives completely unattended. the dream was good but performed correctly only if the same disks were used, while physically cloning drives with a cloner didn’t show the same behavior. Clonezilla was thought of changing something, somewhere. I stopped there.

I bumped my head for a really long time, opening tickets to clonezilla and trying to understand what happened there and simply gave up, not understanding the implications of such poor performance when disk clones were happening on different SSDs of different manufacturers at re-encryption. That was at the time of Q4.0.x past releases, which means different and older tools used by anaconda at install: older parted, cryptsetup, mkfs toolsets, those tools taking decisions on what is reported to the OS at the moment of creating partition table and partitions themselves, hence seeming to explain why it is not possible to copy disk images from and to different drives not reporting same sector sizes and requiring different alignment for best performance. Again, the performance showing radically only when re-encrypting the drives. Day to day operations not seeming to be affected for random reads and writes, SSDs firmware being able to deal with the writing of the blocks in a delayed manner that didn’t seem to impact IOs. One thing is sure, what happens down there, at the LUKS container which actually impacts what is being read/written on disk, when not operating in random read/writes from the firmware, is really impacted for re-encryption. The old Q4.0 image having a LUKS alignment to 512 bytes seeming to be the culprit. It is also to note that as warned under dm-crypt section here, it is not possible to simply call cryptsetup reencrypt --sector-size=4096 device on existing LUKS container: it is true that the filesystems encompassed under the encrypted container will be broken. Funny enough, all templates got corrupted, while private LVM volumes were good. Another can of worms.

Now the interesting bits following this tuning article rabbit hole:

SSD drives tested (Critical MX 500, Samsung EVO 870, Samsung EVO 860 PRO) will have different native sector size reported, which are mostly wrong. 512 may have been true for older drives (the one tested in my quick Q4.1 install) but newer drives should report 4096 to be properly aligned.
cryptsetup 2.4.0 is supposed to be able to detect, and take advantage of properly reported native sector size of drives (Q4.1 installer has 2.3.3 so not a feature we could take advantage even if the drive was reporting its block size properly /sys/block/sdb/queue/physical_block_size). This can be changed at cryptsetup luksFormat call to force 4096 bytes alignment, otherwise LUKS partition is aligned to 512 bytes by default.
mkfs calls are also aligned to 512 bytes unless manually specifying 4096 bytes alignment manually.

The rabbit hole seems to go deeper then that, since some SSDs (TLC flash cells based) seem to perform better if alignment is based on Erase Block size.

Should we invest some time figuring out better installation defaults then 512 bytes (good really old HDDs) while newer HDDs and SSDs should use 4096 bytes by default for better performances?

renehoj · March 14, 2022, 2:47pm

The 860 and 870 have had some fixes recently

Insurgo · March 14, 2022, 3:10pm

@renehoj Thanks for the link.

I would understand this impacting operations for an unlocked LUKS container from the OS for daily operations, which was formatted and configured to trigger trimming operations down to the disk firmware from the kernel, but not for cryptsetup-reencrypt operations? But maybe I am misinformed/misreading?

renehoj · March 14, 2022, 3:21pm

Trim probably doesn’t matter but what about native command queuing?

Insurgo · March 14, 2022, 4:07pm

@renehoj Will edit this post with results later for tests on EVO devices (EVO PRO 860 and EVO 870) and before/after application of this patch. I was not aware of that bug. Was merged upstream there. Thanks!

Since Heads is responsible of OEM->User re-encryption of encrypted device (while old tests were done on older clonezilla for unattended OEM->User disk reencryption and uniqueness of disk images and key for shipped laptops), I will check that kernel they currently use fixes the issue and retest there as well. In my TODOs).

But the question is not related to specific devices like the EVO PRO 860/EVO 870 nor MX500, but to a greater problem: it seems that most SSDs are lying about their real physical sector sizes, have misaligned partition tables for SSD optimal performances and create partitions with improper block sizes defined and sector sizes, optimal values not being currently taken into consideration when creating partition table and partitions under Qubes and other OSes (and taken into consideration in their alignments). Bigger impacts can be seen under non-random read and writes use cases filling SSD caches and where writes are actually lowering performances. Important note here btw, cryptsetup-reencrypt in tested use case is enforcing direct IO, which i’m not even sure the libata patch would be considered (since again, past tests using buffered IO was lowering performance tests and was dismissed): cryptsetup-reencrypt --use-directio -B64 /dev/device --key-slot 0

The improvement noticed is related to an unimportant old Toshiba SSD device, which reports sector size of 512 bytes from smartctl and where calling cryptsetup luksFormat --sector-size=4096 improves reencryption speed. So it seems that misalignment is a more general performance culprit, and question is should we investigate this deeper.

The question is pretty general and concerns automatic partitioning from Qubes installer (and automatic partitioning scheme) for SSD drives specifically: performance improvements/relevance of partitioning alignments, since we cannot rely on what is reported by those SSDs and used to automatize tools optimizations (that could happen if device is not lying only in crypsetup 2.4+ as per referred archlinux article above), while device still report block sizes of 512 (logical[legacy] and physical).

Reposting ArchLinux article on SSD partitining tweaks

Does investigating this makes sense?

51lieal · March 15, 2022, 10:15am

It make sense and I’m interested in testing 512 and 4096 performance in the vm.

Qubes use anaconda installer which many of them is from upstream, and for your request above, I’ve seen someone propose this and accepted in fedora 35, then we can only hope qubes dev will move dom0 to 35+ in testing 4.2.

Insurgo · March 15, 2022, 6:03pm

@51lieal unfortunately, trimming (discard) and performance of disks are not testable in “vm” and require manual tweaks to be tested if performance difference is to be tested, since vm performance depends on created lvm partitions (sectors and sizes), vg created block sizes, then LUKS (most important here from my understanding) sectors and block sizes, needing to be coherent with what the firmware does internally. Not an expert whatsoever here, but intuition here seems to confirm that some disks performing better with reencryption speed have properly reported block size from OS:

[user@dom0 ~]$ cat /sys/block/sda/queue/physical_block_size 
4096

Proper testing seem to require redoing aligned partition table, aligned LUKS partition, and then reinstall (while lvm partitioning also seem to matter).
Yet again, in a faster performing drive, lvms were properly created with better alignements:

[user@dom0 ~]$ cat /sys/block/dm-142/queue/physical_block_size 
4096

It also seem to require not to reinstall from the installer directly, but to do some of the actions first from Qubes available terminals prior of going forward in the installer, and having the installer “rescan” the disks to take into account what was done outside of it to proceed in manual partitioning.

Hey! Just saw that you are the user behind the post I was going to refer: 4.1 installer LVM partitioning - hard to customize, missing space - #5 by 51lieal

Basically, applying the following differences to test

run cryptsetup-reencrypt from a live cd. Take total time and speed in MiB/s at final output.
make aligned partition table in expert mode (just notes) parted -a optimal /dev/sda mklabel gpt with block size of 4096 bytes (default alignment not tuned for special manufacturers)
cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --use-random -y -i 10000 --sector-size 4096 luksFormat /dev/sda2
Follow through with the rest of the instructions to prepare custom partitions from your referred guide.
Install system, make sure that LVM partitions created are good by booting system
do cryptsetup-reencrypt again from live CD, check difference of performance and report results.
Adapt values above, which otherwise seem to be aligned for sector and block sizes of 512 in current observations, as reported in dom0 by: cat /sys/block/sda/queue/physical_block_size

So as of now, I can already see from observations that some SSDs are not reporting properly their block sizes, and that tooling take reported upper layer block size and apply it to the whole down chain from the installer from automatic partitioning.

It also seems that my particular problem came from having suboptimal partitioning for sector/block sizes on initial install, which were cloned from one disk to another in the past.

Those are notes… Not truth. Further experimentation only will confirm or infirm this hypothesis, while others having only changed sector sizes/block sizes/alignment on the same hardware for the same SSD have witnessed major gains in performance (no vm/Qubes usage reported though), while not reencrypting their disks either (which obviously tackle SSD hardware differences as caches, erase block sizes and firmware optimization. So testing seems the only way to validate this, on same computer with same SSDs, where I will test cloning disk of different reported block sizes (512 vs 4096) on other disk and test and report clone disks (on each other) and variations of observed performance (where cryptsetup-reencrypt being my personal meter).

51lieal · March 15, 2022, 6:31pm

I’ve already imagined how to setup the drive, do you have idea what kind of test to run ?

Insurgo · March 15, 2022, 6:44pm

In between tests.

Without IO cache from the operating system (DirectIO):
cryptsetup-reencrypt --use-directio -B64 /dev/device --key-slot 0

With IO caching from the operating system:
cryptsetup-reencrypt /dev/device --key-slot 0

51lieal · March 15, 2022, 6:53pm

I think i would run 3 test,

boot speed
vm benchmark
your re-encrypt test

51lieal · March 16, 2022, 11:39am

Yesterday i’ve just tried 6 installation, and 4 of them was unsuccessful, before installation i did dd zero to drive, ensuring no data remain.

With xfs 512 sector size, everything is work out of the box.
3 fail attempt with xfs 4096 sector size, and 1 on ext4, I do short investigation but it didn’t helpfull. Dom0 is fine, but i can’t find any DomU is working (there’s an error in initial setup).
btrfs 4096 is fine, but i haven’t benchmark.

Everything have default configuration.

512 benchmark :

CPU : I7-10750H
Storage : WD SN 730 512GB
File System : LVM-XFS

Linux dom0 5.10.90-1.fc32.qubes.x86_64 #1 SMP Thu Jan 13 20:46:58 CET 2022 x86_64 x86_64 x86_64 GNU/Linux

# Boot speed
Startup finished in 4.897s (firmware) + 2.523s (loader) + 2.946s (kernel) + 8.787s (initrd) + 3.705s (userspace) = 22.861s
Startup finished in 4.868s (firmware) + 2.513s (loader) + 2.938s (kernel) + 8.817s (initrd) + 3.765s (userspace) = 22.902s
Startup finished in 4.874s (firmware) + 2.511s (loader) + 2.945s (kernel) + 8.255s (initrd) + 3.732s (userspace) = 22.318s

# VM Boot
6.24
4.81
4.68
5.14
4.84

# Cryptsetup-reencrypt
Finished, time 15:24.011, 486745 MiB written, speed 526.8 MiB/s

I have found that cryptsetup-reencrypt without --directio is horrible, speed is under 50 MiB/s so i just skip it, and fyi I use my main qubes as host to reencrypt and change nvme lbaf (Dual Boot)

51lieal · March 16, 2022, 9:00pm

4096 benchmark :

CPU : I7-10750H
Storage : WD SN 730 512GB
File System : BTRFS+blake2b

Linux dom0 5.10.90-1.fc32.qubes.x86_64 #1 SMP Thu Jan 13 20:46:58 CET 2022 x86_64 x86_64 x86_64 GNU/Linux

Startup finished in 4.898s (firmware) + 2.499s (loader) + 2.878s (kernel) + 7.922s (initrd) + 3.489s (userspace) = 21.688s
Startup finished in 4.878s (firmware) + 1.405s (loader) + 2.882s (kernel) + 7.936s (initrd) + 3.523s (userspace) = 20.626s
Startup finished in 4.889s (firmware) + 1.405s (loader) + 2.881s (kernel) + 7.817s (initrd) + 3.524s (userspace) = 20.518s

5.72
4.48
4.62
4.48
4.62

# directio
Finished, time 11:17.770, 486745 MiB written, speed 718.2 MiB/s

# no directio
Finished, time 12:32.743, 486745 MiB written, speed 646.6 MiB/s

I’m suprised when running no directio option, i’ll update in another thread after finding how to configure lvm+xfs / ext4 in 4096 sector size. the question is :

Why in xfs (512 sector size) the performance is dropped a lot, 600 to 50 is a huge number.
How to make 4096 sector size work with lvm+xfs / ext4 ?
If in 4.2 testing qubes planning to use fedora 35+ or other distro, should dev promote using 4096 sector size by default ? since in fedora 35+ anaconda would automatically use 4096 sector size, only if the drive is already using 4096 sector size. (qubes team question)

As a conclusion, using 4096 sector size is very recommended, there’s a lot of benefit gained for modern hardware.

Insurgo · March 16, 2022, 10:28pm

On my experiments, relaunching blivet advanced partition on a failed attempt reports a bunch of wrong partitiions, all corresponding to templates. All private volumes were interestingly fine.

To reproduce my basic initial test result, i simply created the cryptsetup from ctrl-alt-2 over /dev/sda2 with luksFormat, lukOpen’ it then asked Q4.1 over ctrl-alt-6 to rescan the disk prior of doing an automatic partitioning, reclaiming space.

Hints here are that templates rpm instructions may be faulty in deploying raw images into corresponding LVMs on default partitioning scheme? Otherwise it seems that partitions created by second stage install outside of templates installed are fine.

@marmarek some hints?

51lieal · March 16, 2022, 10:55pm

I think not, because my btrfs installation is fine, but I still have some workround.

and can you give more details about this ?

did this mean, you have succesfully install with lvm+xfs / ext4 ?

Insurgo · March 16, 2022, 11:00pm

@51lieal it is expected that buffered IOs should behave better then directio if every assumptions the tools are making are right. You seem to have something in your successful test case, where I’m confused by the results.

The reencryption itself doesn’t know anything about the underlying partitions. The speed results should reflect only rhe sector size having been applied tonthe LUKS container, nothing else.

Insurgo · March 16, 2022, 11:12pm

No, default partition scheme being thin lvm over cryptsetup failed. In my use case, i am not ready to give up on wyng-backups which relies on thin LVM.

My tests are really non-conclusive for the moment.

I’m stump at not understanding why the installer can scan the 4096 formatted LUKS volume being luksOpen’ while the resulting automatic partitioning fails to provision proper template in root related LVMs. This is why I tagged @marmarek.

I am not aware of the differences in code which correctly creates private volumes counterpart, while failing at deploying root volumes linked to templates rpm.

Why are private volumes and dom0 consistent with underlying disk gpt partition table and cryptsetup created container aligned to 4096 sector size, while root volumes linked to template deployment is failing?

51lieal · March 16, 2022, 11:30pm

Perhaps because i use qubes to reencrypt ? haven’t try with other os, but i’ve satisfied with the result.

I’ve been playing with lvm configuration and it still fail, let’s see if bypassing initial setup and manually configure would work. (even though it doesn’t make sense to me because btrfs installation is fine)

51lieal · March 17, 2022, 1:52am

I’ve confirmed this, manually configure everything is worked, i really don’t know what causing this, in initial setup the error is about libxl failed to add vif device (what device? this is what i’ve confused), so the step you need to do is.

add vm-pool.
install template.
configure everthing.

Insurgo · March 18, 2022, 3:28pm

Your tests reporting cryptsetup-reencrypt results on same hardware, same disk different disk partition table/ partition alignment/partition block size going from direct-io resulsts of 526.8 MiB/s to 718.2 MiB/s in reported tests shows a big difference for the LUKS container alignment performance test alone. This is important report. Like said previously, that reencryption test shouldn’t care on what is the actual content of the container. In my experience, speed with direct-io reported pretty steady speeds all along the reencryption, leading to the hypothesis that all the blocks are forward read, reencrypted and written back to disk without speeding up if unused or slowing down if used. The data seem simply translated and rewritten as it goes.

Buffered IO being improved massively (50MiB/s initially reported) vs 646.6 MiB/s is also an interesting data, showing that better alignment leads to better results, while still showing something off. Buffered IO should be better then direct-io, meaning something is not right (reported by heardware vs real), yet.

@51lieal Can you post recipe of commands that made it successful to you for the thin lvm scenario?

(A list of commands that were successful to you, just like you did on 4.1 installer LVM partitioning - hard to customize, missing space - #5 by 51lieal would permit exact reproducibility of results, intern validity and possible external validity of results. If we come up with proper adjustments, we could open an issue upsteam and challenge others.

51lieal · March 18, 2022, 9:28pm

this is based on uefi, for mbr just ignore efi thing, here is your quick setup:

ctrl + alt + f2 when you in language setup

dd if=/dev/zero of=/dev/device 
gdisk /dev/device
# you need at least 2 partition for mbr and 3 for uefi
1. +600MiB
2. +1GiB
3. the rest of remaining space.

cryptsetup -c aes-xts-plain64 -h sha512 -s 512 --sector-size 4096 luksFormat /dev/device
cryptsetup luksOpen /dev/device luks
pvcreate /dev/mapper/luks
vgcreate qubes_dom0 /dev/mapper/luks
lvcreate -n swap -L xxG qubes_dom0 ( ex : 8G / 16G )
lvcreate -T -L 40G qubes_dom0/root-pool
lvcreate -T -l 90%FREE qubes_dom0/vm-pool
lvs (to check your vm-pool size)
lvcreate -V40G -T qubes_dom0/root-pool -n root ( 20G is not enough, use at least 40G or more )
lvcreate -VxxxG -T qubes_dom0/vm-pool -n vm ( ex : -V60G / -V360G )
mkfs.xfs /dev/qubes_dom0/vm (no need to specify sector size, if your disk is already use 4096) 

haven't try with ext4, but i think it would work too, since the problem is in initial setup

ctrl + alt + f6
enter disk and rescan, choose drive, custom (not blivet), click unknown, and set :

600 MiB > format > EFI partition > /boot/efi > update
1 GiB > format > xfs / ext 4 > /boot > update
40 GiB > format > xfs / ext 4 > / > update
(swap) > format > swap > update
---
leave qubes_dom0/vm 
click done

configure red mark, and install.

after boot :
don’t configure anything, click done, and login.

qvm-pool -a vm lvm_thin -o volume_group=qubes_dom0,thin_pool=vm-pool,revisions_to_keep=2
reboot

confirm vm is the default_pool

qubes-prefs | grep pool ( in 3 installation, vm is automatically default_pool )
# if not : 
qubes-prefs default_pool vm

set default kernel in qubes-global-settings.
set none in all of the qubes default.

template directory = /var/lib/qubes/template-packages/
install all template.
use salt to configure vm.

qubes is ready to use.
I have update everything then reboot, everything still good.