Storage configurations that perform best under common Qubes workflows

In 2025 there is broad market availability of high speed consumer-level NVMe storage, and even consumer-level motherboards sometimes offer sufficient PCIe lanes to put together interesting non-trivial SSD topologies.

So, we have cheap-ish high speed SSD storage. How can we best make use of it to make a Qubes installation fast?

Common Qubes procedures that are storage IO-bound:

  • qvm-clone
  • qvm-backup
  • qvm-start, to a degree
  • (Others?)

RAID/filesystem options for a set of SSD:

  • ZFS, e.g. a pool of mirror vdevs; but I’ve read that the ZFS performance gain drops off / goes negative with core storage being NVMe SSD (rather than SATA HDD) and the argument for ZFS shifts away from performance toward its other virtues
  • mdadm, in some configuration (I have no experience here)
  • btrfs, in some configuration (I have no experience here)
  • (Others?)

For the common Qubes procedures above, does any complex SSD topology beat the read/write speed of a lone xfs/ext4 PCIe 5.0x4 NVMe SSD? (Assuming for the moment we’re not too interested in redundancy/compression/snapshotting/other features of drive pooling under e.g. ZFS?)

Thanks for sharing your knowledge and opinions.

2 Likes

How fast do you need? There is fast, and fast. It will depend on the person.

3 Likes

First of all, I would like to share some facts (rather than my personal bias).

The following storage scenarios are actively tested via openQA integration tests:

  1. system_tests_basic_vm_qrexec_gui_btrfs
  2. system_tests_basic_vm_qrexec_gui_ext4
  3. system_tests_basic_vm_qrexec_gui_xfs
  4. system_tests_basic_vm_qrexec_gui_zfs

So you have some sort of guaranty that the above configs should work (there is not such a guaranty for other FS for example Bcachefs).

We have system_tests_storage_perf test. But it is not performed per individual filesystem (maybe open an issue and suggest the tests to be performed per individual fs). The output could be also more user-friendly to allow users to see for themselves if there is any noticeable performance gain per filesystem.

Second of all, there are other techniques which are going to make the OS filesystem related operations noticeably faster. For example we will have instantaneously started disposable VMs in 4.3. A technique to have few DispVMs preloaded in RAM & ready whenever you need them.

Last but not least, what Solene mentions.

6 Likes

I mean, fastest would probably be to forgo persistent storage entirely for specific qubes/workflows in favour of a ramdisk. So, less fast than that. :slightly_smiling_face: Fastest solution using persistent storage I suppose.

I’m working up to building a new workstation with good hardware, splurging a bit, with a goal to minimize the user’s pause/idle/wait-for-completion time to the extent that’s practical. I’ve pinned down my hardware except for storage.

2 Likes

You actually should at least consider some of them, like ZFS’ ARC. Also can you imagine ZFS’ dedup being useful !?

Assuming they work with qubes, of course. Not like I checked lol.

ARC should help with starting disposables (even if you don’t have enough ram you could try using a “lone PCIe 5.0x4 NVMe SSD” / RAID 0 - like setup for second level ARC) , while dedup might make cloning pretty much immediate.

I assume that nothing will help with backups that much, since you should store backup remotely / on a separate drive. Considering cloning and snapshots, do you intend to store some backups locally? If so, why? Just curious.

p.s. compression is more of an i/o speed and storage capacity for CPU work tradeoff. Use if you want to ditch efficiency in exchange for effectiveness (and have spare CPU cycles to work it)

p.p.s IMO mdadm is outdated, you could use lvm directly instead.

Also I heard that the main benefit of btrfs is their snapshots, but raid 5 and raid 6 support is kind of meh. But please, please check it yourself, I don’t want to ruin their reputation, and I haven’t tested it myself in a while (years)

It also seems like SSD write amplification is caused by incorrect ashift, but there is no experimental proof in the thread you link to. I spin rust, there would be no proof from me.

2 Likes

Also I think I’m wrong about using RAID 0 for L2ARC, you mostly need read speeds there, so anything significantly faster than your main storage will do, not specifically RAID 0.

Another benefit of L2ARC is that it can help to improve deduplication performance, but the whole deduplication thing is a questionable decision. Here’s a great read on that:

2 Likes

That’s an interesting point about deduplication. Has anyone tested ZFS with dedup enabled on the volume that hosts VM images? If so, does it indeed speed up e.g. qvm-clone as one would hope, and are the trade-offs tolerable in practice?

(edit: or maybe this isn’t as relevant as I was thinking- my brain momentarily confused deduplication with copy-on-write)

Just trying to frame a question that might get to the heart of what I want to learn. :slightly_smiling_face: But details are relevant too. I will backup remotely to an old NAS (running ZFS).

Regarding secondary ARC, it feels like I only ever see it mentioned followed by a caveat like “but you probably won’t benefit from this except in specific corner applications”. Would quick-starting disposables be such a case?

Never seen such mentions, but it makes sense: secondary ARC is a compromise, you are better off just having more RAM. But if you don’t, and you have difference in drive speeds large enough to leverage, why not?

On the other hand, if all your drives are fast NVMe anyway, significantly faster drive will be expensive. Are you sure that you can’t just get more RAM? RAM is more useful and faster than L2ARC.

1 Like

Thanks for mentioning this. I was unaware that zfs was in the the QA tests.

So I can see that the zfs test passed. If a user wanted to figure out the topology used, (for example, how many redundant disks), is there a easy way to do that?
I see I can click on the green sphere to get info about the test resutls, and there is a settings tab there that even tells us the number of disks, but i dont see anything that points to the definition of the test.

It is only one disk:

Those tests are performed on a (KVM) virtual environment.

1 Like

It should be relevant, COW is about copying data on writes, it helps with power outages and snapshots, writes are still there (to be more specific, qvm-clone actually writes another copy of data, I think). Deduplication will do its best to prevent writes, but it depends on deduplication mechanism in use as well.

For example, block-wise dedup is likely to fail if you’re writing compressed or encrypted files, because individual blocks aren’t the same. It may or may not work with qube images, even if they aren’t. Say, first bits of information will change - but not enough bits are added to make another block, and system will try to write the same image, but bits in blocks are slightly shifted - block-wise dedup doesn’t do anything. I don’t know if this can happen in qubes though. Is metadata (it must change upon cloning) stored separately from the image? Also what if dedup is smart enough to sus this out and actually work anyway?

1 Like

I believe making a copy of a template (I.E. “cloning”) in both zfs and lvm is immediate because they are not actually copied, they become “copy on write”. My evidence for this is: Disk usage numbers in Qube Manager are deceptive · Issue #7535 · QubesOS/qubes-issues · GitHub

dedup could be useful particularly for templates. If if templates are cloned, but then updates are run against each template, the same package gets installed in both, which is now no longer deduped.

The problem will be that dedup takes up RAM, and RAM on a qubes system is always at a premium. (I have 128G ram and I still run out). Your deduplication link does a great job of explaining the memory issues. The new dedup sounds like it helps this situation. Another thing that could help the situation would be to have deduplication turned on on a “templates” datastore, and a datastore for the appvms with dedup turned off.

Offline deduplication could fix this. There seems to be one guy who has attempted a “ZFS offline deduplication tool”, however:

  • It does say “Provide a warning that deduplicated data incurs large ZFS memory requirement, even after deduplication is done.”. With “old zfs dedup” if you ever used dedup, it would consume memory even after you turned dedup off. This might be what he is referring too. However, the new dedup that you link to may not have the same issue, and so perhaps combining his tool with the new “fast dedup” would allow a solution?
  • I have not looked at how his tool works, so it might not be applicable
  • It does not seem to be used very much, and therefore not well tested.
3 Likes

Great! Thank you!

Actually this is specifically why I think that dedup can be used in qubes. According to zfs manuals one needs approximately 1.25 GiB of RAM per TiB of storage. Since QubesOS needs so much RAM in general, qubes users are more capable of sparing 1-5 GiB. If you have 8 or even 16 GiB overall, 1.25 is a lot. But if you’re using qubes and have 128 G of ram anyway, 1.25 GiB is approximately 1%, which is much more viable considering possible benefits.

In fact, I suspect that caching side of ARC might eat much more memory on qubes system than dedup table. Imagine caching whole VM.

2 Likes

By the way, I should add: given that it seems to be doing copy-on-write, I have no idea why qvm-clone of big templates takes so long!

(experience stated is for qubes with lvm… have no tried zfs yet with qubes)

1 Like

Thanks. So if I understand correctly, it’s top level definition is in templates.json:

{
  "description": "basic tests on ZFS filesystem (zfs pool)",
  "name": "system_tests_basic_vm_qrexec_gui_zfs",
  "settings": [
    {
      "key": "NUMDISKS",
      "value": "2"
    },
    {
      "key": "PARTITIONING",
      "value": "zfs"
    },
    {
      "key": "START_AFTER_TEST",
      "value": "system_tests_update"
    },
    {
      "key": "SYSTEM_TESTS",
      "value": "qubes.tests.integ.basic qubes.tests.integ.vm_qrexec_gui:14400"
    }
  ]
},

main.pm then directs it to “switch_pool.pm” because we set the PARTITIONING=zfs flag

Then in switch_pool.pm, we setup the partitioning, and as long as there is no error returned, it counts as passing.

} elsif (get_var('PARTITIONING') eq 'zfs') {
    assert_script_run('qubes-dom0-update -y zfs', timeout => 900);
    assert_script_run('modprobe zfs zfs_arc_max=67108864');

    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sdb');
    assert_script_run('zpool create -f testpool /dev/sdb1');
    assert_script_run('qvm-pool add --option container=testpool pool-test zfs');

Then it moves all the templates to the new pool

Then presumably it runs all the other tests after that, with any templates being pulled from the new pool.

Now, presumably, if we changed NUMDISKS to 5:

      "key": "NUMDISKS",
      "value": "5"

and changed:

    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sdb');
    assert_script_run('zpool create -f testpool /dev/sdb1');

to:

    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sdb');
    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sdc');
    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sdd');
    assert_script_run('printf "label: gpt\n,,L" | sfdisk /dev/sde');
    ### for write performance do this:
    assert_script_run('zpool create -f testpool /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1');
    ### for read performance do this:
    assert_script_run('zpool create -f testpool mirror /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1');
    ### for 50% redundancy do this: 
    assert_script_run('zpool create -f testpool raidz2 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1');

we could test it with 4 drives.

Now i don’t see times/durations getting returned anywhere, but I do see timestamps in some of the logs, so times/durations could be manually computed. So for example the time duration of the migration of the templates could serve as a zfs write speed test (with a confounding factor of lvm readspeed getting tested at the same time). Presumably the time it takes a later test to launch and run a VM using a template from zfs could be used to indicate a zfs read speed test.

However, you mentioned using kvm. This implies that none of the performance tests would actually be valid, because all the “parallel disks” that would be giving the performance gain are actually partitions on the same physical device.

Actually testing for zfs/btrfs performance enhancements would seem to need some specific hardware setup that I dont know if they have.

1 Like

Yes. correct.

hw13 is an MSI MS-7E06 (most probably with Dasharo firmware). Which should have 4x Gen4 M.2 slots and 6x 6Gbps SATA ports. I am not aware if it has additional drives installed. Or if it has the drives, I am not aware if the core team would be interested to implement such tests.

1 Like

See system_tests_storage_perf@hw1 tests. One of the recent videos here:

On my (Btrfs) system, qvm-clone spends most of its time cloning the app menus. Probably lots of room for code optimization there. Cloning the storage is pretty much instant already, unless there’s a grotesquely fragmented volume.

4 Likes

I see, the timing/performance results are in the file called system_tests-perf_test_results.txt in the “uploaded logs” section of Qubes OS openQA: qubesos-4.3-pull-requests-x86_64-Build2025060520-4.3-system_tests_storage_perf@hw1 test results

“startup”, “fetch_packages_list”, and “shutdown” are the things that were executed before and after the “system_tests” was executed. (the other 3 were to make a deterministic (and working) environment just to be able to run the “system_tests”.

I think I’m getting the hang of this.

Would be nice if the individual checks had better names then “AUDfe-0-” and “wait_serial” though :slight_smile:

The One True Way to speed up backups is to switch to incremental backups. This requires one of the various 3rd party solutions. For LVM and btrfs the most popular/reputable is Wyng. For ZFS, I recommend backup.py (I am incredibly biased and probably the only one who has used it).

1 Like