Hello everyone.
SSD is not a new technology, but there seems to be some tweaking needed, even nowadays, to be able to setup it for best performance, and those are not done by default even today. Performance gains are reported to be massive. Some reporting gains at random read between 100 MiB/s to some gigantic 500 MiB/s. I was not able to replicate as of now, but wondering if that should be investigated from willing testers, me included.
I’m writing this post for numerous reasons, the first being to have knowledgeable people jump aboard and correct the facts, maybe even from upstream linked documentation.
The reason why this is important, at least for Insurgo, is that deploying an OEM disk image is expected to be re-encrypted by the OEM (so the key encryption disk is unique) and the the end user at reception of the hardware, so that even if the OEM kept a copy of the LUKS header, even on a gag order, the OEM could not disclose the header to authorities so that authorities could restore the Header and type the OEM provisioned Disk Recovery Key passphrase. Re-encrypting drives is probably the only case where massive gains would be directly observable, since this applies gains at each of the important levels: the reads are not random, cache of the SSD drive being stretched to its limit to read chunks in advance, reencrypt it in RAM and then writing it on disk, triggering firmware specific physical block sizes: Erase Blocks, “sector” sizes and partition alignment so that what is read and written is not requiring the firmware to touch other blocks for each read and write operations.
With newer basic testing on Q4.1 installations, it came to my attention that older SSD technologies (chinese unknown OEM included), which were old refurbished SSDs drives coming with received laptops (not EVO PRO 860 or other really good performing SSDs) were actually performing wayyyyy better at re-encryption of the disk through cryptsetup-reencrypt then before. And by a big margin: 80MiB/s second became 145 MiB/s! But why?
It reopened an old, forgotten can of worm for me, since in the past, I made a proof of concept to be able to network clone an OEM disk image across multiple devices using clonezilla in BT cloning mode. The cloning operation worked great in all case, speed being limited mainly by Ethernet cable quality and switch performance. But when arrived at the cryptsetup-reencrypt step from the clients who just had perfect copy of the OEM disk image (in dd mode since nothing can be reinterpreted for the LUKS container, so no modification of the input to be written on destination), the cryptsetup-reencrypt call was having really really poor performance, unless the source disk and destination drives were of the same brand. It was thought that clonezilla tool stack was to blame, since doing a physical clone (black box here, unfortunately) was not showing the same result. And I stopped investigating there since I had a working solution, while not scaling as great as clonezilla would, being only limited by network switch performance, all nodes participating on clonezilla Bittorent mode validating chunks downloaded and swarming locally, leaving the cloning of the drives completely unattended. the dream was good but performed correctly only if the same disks were used, while physically cloning drives with a cloner didn’t show the same behavior. Clonezilla was thought of changing something, somewhere. I stopped there.
I bumped my head for a really long time, opening tickets to clonezilla and trying to understand what happened there and simply gave up, not understanding the implications of such poor performance when disk clones were happening on different SSDs of different manufacturers at re-encryption. That was at the time of Q4.0.x past releases, which means different and older tools used by anaconda at install: older parted, cryptsetup, mkfs toolsets, those tools taking decisions on what is reported to the OS at the moment of creating partition table and partitions themselves, hence seeming to explain why it is not possible to copy disk images from and to different drives not reporting same sector sizes and requiring different alignment for best performance. Again, the performance showing radically only when re-encrypting the drives. Day to day operations not seeming to be affected for random reads and writes, SSDs firmware being able to deal with the writing of the blocks in a delayed manner that didn’t seem to impact IOs. One thing is sure, what happens down there, at the LUKS container which actually impacts what is being read/written on disk, when not operating in random read/writes from the firmware, is really impacted for re-encryption. The old Q4.0 image having a LUKS alignment to 512 bytes seeming to be the culprit. It is also to note that as warned under dm-crypt section here, it is not possible to simply call cryptsetup reencrypt --sector-size=4096 device
on existing LUKS container: it is true that the filesystems encompassed under the encrypted container will be broken. Funny enough, all templates got corrupted, while private LVM volumes were good. Another can of worms.
Now the interesting bits following this tuning article rabbit hole:
- SSD drives tested (Critical MX 500, Samsung EVO 870, Samsung EVO 860 PRO) will have different native sector size reported, which are mostly wrong. 512 may have been true for older drives (the one tested in my quick Q4.1 install) but newer drives should report 4096 to be properly aligned.
- cryptsetup 2.4.0 is supposed to be able to detect, and take advantage of properly reported native sector size of drives (Q4.1 installer has 2.3.3 so not a feature we could take advantage even if the drive was reporting its block size properly
/sys/block/sdb/queue/physical_block_size
). This can be changed at cryptsetup luksFormat call to force 4096 bytes alignment, otherwise LUKS partition is aligned to 512 bytes by default. - mkfs calls are also aligned to 512 bytes unless manually specifying 4096 bytes alignment manually.
The rabbit hole seems to go deeper then that, since some SSDs (TLC flash cells based) seem to perform better if alignment is based on Erase Block size.
Should we invest some time figuring out better installation defaults then 512 bytes (good really old HDDs) while newer HDDs and SSDs should use 4096 bytes by default for better performances?