When I start btrfs scrub on multiple hdds (sata (dom0) + luks unlocked in appvm) Qubes freezes completely maybe the mouse moves with an extreme delay but thats it.
I am on Amd and have 13 drives, they are all on btrfs but separate file systems (no raid or anything).
I manage the disks manually with bash scripts (btrbk etc. which works really well).
But when I try to scrub more than one drive at a time Qubes freezes completely.
This really bothers me other disk operation seem to work fine.
Does anybody have any tips on how to improve this? Qubes freezes sometimes as well but this is the only operation where I can unfortunately replicate it
What kind of disks are these? Spinning disks? NVME?
Some half-baked ideas about things you can look into:
If NVME, check modinfo nvme and modinfo nvme_core and review optional paraneters controlling things like number of queues as well as operating mode (e.g. poll mode)
If NVME, tweak the parameters mentioned in the previous item at runtime via sysfs, if you find some that help but you donāt want active all the time
Look into io priority for these scrub tasks. Iām guessing scrub i/o operations take place in kernel threads, Iām not knowledgeable about how exactly you can tune them, probably some LLM knows
Look into whether the most appropriate interrupt type is being used. Iām not an expert in low level OS / interrupts but as I understand it there are āoldā style interrupts, MSI and MSI-X. Most NVME support MSI, you could make sure thatās being used. Try cat /proc/interrupts
Look through xl dmesg and dmesg (journalctl -b I think?) to see if there are any messages at boot about failing to allocate / set up anything related to these drives
Do the same during or after the scrub operation looking for anything obviously wrong
Try using the perf read workqueue and/or perf write workqueue LUKS2 flags. Check here and here for some info (control-f + perf on the pages)
I know these arenāt solutions, but at least you can investigate a little. Happy to help diagnose anything you paste here, I spent a good deal of time working on i/o performance on my system, maybe I learned something helpful
I should acknowledge @soleneās comment, these operations are inherently going to hammer the drives with i/o. Maybe reducing the priority of the i/o threads (again, I donāt know how to do this for btrfs kernel threads). But rather than degrade the system less for longer, her idea (doing them off hours) or staggering them is a better solution
Okay these are not that difficult but even with -c 3 -n 7 (lowest prio possible ) the system still freezes with 10 simultaneous scrubs. ( maybe I have to use a different scheduler?)
Okay yeah setting the target vm to HVM and using that qubes kernel and then setting the scheduler for all the devices to bfq + -c 3 -n 7 + limiting the scrub to 100M definitely helps.
I am not sure tho if this is correctly done.
And system still freezes
Will have to try applying this now:
Try using the perf read workqueue and/or perf write workqueue LUKS2 flags. Check here and here for some info (control-f + perf on the pages)
Last note, I noticed this on the link @solene sent. I think you would want a value lesser than 100MB, though
Since linux 5.14 itās possible to set the per-device bandwidth limits in a BTRFS-specific way using files /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max. This setting is not persistent, lasts until the filesystem is unmounted. Currently set limits can be displayed by command btrfs scrub limit.
$ echo 100m > /sys/fs/btrfs/9b5fd16e-1b64-4f9b-904a-74e74c0bbadc/devinfo/1/scrub_speed_max
$ btrfs scrub limit /
UUID: 9b5fd16e-1b64-4f9b-904a-74e74c0bbadc
Id Limit Path ā --------- -------- 1 100.00MiB /dev/sdx
And Iām still pretty confident just doing them serially will be better. The issue seems like it might be related to the HBAs more than anything else
Pretty sure itās full system lockup. Sometime happens during regular use as well (most of the time when there is load and I launch something or use something heavy like a game but sometimes also seemingly random. I should have more than enough CPU time tho).
Just to be clear, itās only the parallel operation that brings the system to its knees?
Please tell us the exact CPU/motherboard combo and HBA type.
Without knowing anything, Iād guess that your system is suffering from I/O starvation. Did you calculate PCI lanes?
For a start Iād check the drive cables and ditch that x1 controller in favour of some used LSI 2008 flashed to IT/strict pass through mode (some Intel M1015s are around 40 bucks over here via eBay). I would do this anyway, as long as your motherboard and case support some PCIe 2 card of that length. Iād double check every single drive is CMR as well and SMART status is ok.
Even with that: donāt expect miracles. I donāt know much about btrfs. But I know ZFS. Scrubbing large (i.e. wide) pools of spinners attached to some consumer hardware could take not only hours but days.