While troubleshooting a problem with wakeup from sleep on my Lenovo Thinkpad T470s, I realized that SMT was enabled in the BIOS, and even though it’s disabled at runtime by Xen/Qubes, for some reason the BIOS setting broke sleep/wakeup (the computer would go to sleep but need a hard reboot to wake up). Disabling SMT in the BIOS/UEFI fixed the issue, as mentioned in the documentation.
Now, the interesting part is that disabling SMT in the BIOS seems to have caused general performance improvement and, possibly, faster boot times (call me crazy, this could just be my perception and some snake oil). Is this expected? If so, disabling SMT in the BIOS should be indicated more prominently in the documentation (perhaps at the beginning of the installation section?).
If I’m crazy and my perception is incorrect, please forgive my false report. Perhaps I’m so happy to see sleep work properly that I’m starting to see things. I don’t feel like going back to enabling SMT to perform some serious benchmarking, though
@ludovic Thanks for the quick response and for pushing me towards more scientific validation. It seems that I was right, see for yourself:
SMT disabled in the BIOS
Startup finished in 15.718s (firmware) + 3.991s (loader) + 10.076s (kernel) + 9.945s (initrd) + 50.683s (userspace) = 1min 30.415s
graphical.target reached after 50.636s in userspace
SMT enabled in the BIOS
Startup finished in 16.048s (firmware) + 4.161s (loader) + 10.179s (kernel) + 10.404s (initrd) + 1min 43.542s (userspace) = 2min 24.337s
graphical.target reached after 1min 43.488s in userspace
Times seem to be longer across the board. All services take longer to start, according to blame.
Should “disable SMT in your BIOS/UEFI” be more prominently displayed on the documentation then? Not only did it fix my wakeup from sleep problem, it made the system’s performance significantly better too.
Does anyone know if there is an explanation on why disabling SMT in the BIOS/UEFI would make such a dramatic difference in overall performance?
@tzwcfq That is interesting! I see a roughly 40% speedup if I turn SMT off in the BIOS/UEFI, as shown by the boot times (and VM start times too). Perhaps the difference between our experience has to do with the processor generation and/or the overall hardware. As I said above, this is on a Lenovo Thinkpad t470s, with an Intel Core i5 6300u with 12GB of DDR4 RAM in flex configuration (4GB soldered in and 8 GB on a second slot), so it’s partially dual channel. You, on the other hand, are on a high end latest generation CPU, so perhaps it’s an issue with how well your CPU is supported by Xen and Linux in general.
My kernel is older and the officially provided with Qubes 4.1. Now, interestingly enough, even my Windows 10 qube runs significantly faster with SMT turned off in the BIOS, so I don’t think the Linux kernel version really changes much. You may want to look into the Xen version instead.
Of course, it could be the Linux kernel in Dom0 that somehow affects the overall performance of I/O, and surely the linux kernel in your Linux VMs may have something to do with it too when it comes to those VMs, but you should rather look at Xen itself first.
This may be related to the lack of Intel Hardware P-states (HWP) support in Xen: CPU Frequency Scaling Broken · Issue #4604 · QubesOS/qubes-issues · GitHub
Since your CPU doesn’t have HWP support so maybe it’s not affecting you.
I’ve tried to use dom0 based cpufreq instead of xen based cpufreq: Xen power management - Xen
But I got this warning: intel_pstate: CPU model not supported
Then I’ve checked CPU flags and there were no HWP flags: lscpu | grep Flags | tr ' ' '\n' | grep hwp
It seems that Xen hides CPUID leaf 0x06 from dom0. Related patch: xenbits.xen.org Git - xen.git/commitdiff
And this leaf is providing information on HWP support: is intel_pstate working with or without HWP? - Intel Communities
And since kernel in dom0 can’t see that HWP is supported by CPU it’s not loading intel_pstate module.
I’ve tried to add intel_pstate=hwp_broken_firmware kernel command line option but it didn’t help.
Maybe someday I’ll try the patch from linked Qubes issue.
I’ve tried to test boot time in more detail and it doesn’t seems that reliable.
First thing is that boot time is unstable. I think it may be related to the CPU E-cores usage.
And second is that boot time somehow depends on how much CPU cores are assigned to dom0 with dom0_max_vcpus xen cmdline.
It seems that the best boot time is when dom0 has 2 vcpu and boot time is around the same for all 3 BIOS/Xen SMT settings combinations.
But if all vcpu are given to dom0 then with BIOS smt=on xen smt=off boot time is noticeably faster.
Here are my results for multiple runs:
BIOS smt=on xen smt=on
Summary
dom0 has all 24 vcpu (no dom0_max_vcpus specified)
Startup finished in 14.478s (firmware) + 1.148s (loader) + 1.976s (kernel) + 6.949s (initrd) + 1min 1.668s (userspace) = 1min 26.221s
graphical.target reached after 1min 1.655s in userspace
Startup finished in 14.471s (firmware) + 1.280s (loader) + 1.912s (kernel) + 8.297s (initrd) + 57.730s (userspace) = 1min 23.693s
graphical.target reached after 57.703s in userspace
Startup finished in 14.477s (firmware) + 1.561s (loader) + 2.015s (kernel) + 7.611s (initrd) + 53.100s (userspace) = 1min 18.765s
graphical.target reached after 53.076s in userspace
Startup finished in 14.473s (firmware) + 1.180s (loader) + 1.986s (kernel) + 6.757s (initrd) + 57.199s (userspace) = 1min 21.598s
graphical.target reached after 57.167s in userspace
dom0 has 16 vcpu (dom0_max_vcpus=16)
Startup finished in 14.479s (firmware) + 1.166s (loader) + 1.850s (kernel) + 6.442s (initrd) + 54.048s (userspace) = 1min 17.987s
graphical.target reached after 54.007s in userspace
Startup finished in 14.473s (firmware) + 1.114s (loader) + 1.843s (kernel) + 6.923s (initrd) + 59.794s (userspace) = 1min 24.148s
graphical.target reached after 59.755s in userspace
dom0 has 8 vcpu (dom0_max_vcpus=8)
Startup finished in 14.468s (firmware) + 1.179s (loader) + 1.936s (kernel) + 6.423s (initrd) + 45.021s (userspace) = 1min 9.029s
graphical.target reached after 45.005s in userspace
dom0 has 4 vcpu (dom0_max_vcpus=4)
Startup finished in 14.477s (firmware) + 2.358s (loader) + 1.863s (kernel) + 5.855s (initrd) + 45.670s (userspace) = 1min 10.225s
graphical.target reached after 45.649s in userspace
dom0 has 2 vcpu (dom0_max_vcpus=2)
Startup finished in 14.480s (firmware) + 1.066s (loader) + 1.927s (kernel) + 5.962s (initrd) + 35.097s (userspace) = 58.535s
graphical.target reached after 35.075s in userspace
Startup finished in 14.475s (firmware) + 1.133s (loader) + 1.820s (kernel) + 5.950s (initrd) + 34.694s (userspace) = 58.074s
graphical.target reached after 34.684s in userspace
dom0 has 1 vcpu (dom0_max_vcpus=1)
Startup finished in 14.467s (firmware) + 1.182s (loader) + 1.900s (kernel) + 7.829s (initrd) + 44.180s (userspace) = 1min 9.560s
graphical.target reached after 44.154s in userspace
Startup finished in 14.471s (firmware) + 1.251s (loader) + 1.952s (kernel) + 7.360s (initrd) + 44.826s (userspace) = 1min 9.862s
graphical.target reached after 44.799s in userspace
BIOS smt=on xen smt=off
Summary
dom0 has all 16 vcpu (no dom0_max_vcpus specified)
Startup finished in 14.465s (firmware) + 1.229s (loader) + 1.970s (kernel) + 5.721s (initrd) + 42.034s (userspace) = 1min 5.421s
graphical.target reached after 42.022s in userspace
Startup finished in 14.473s (firmware) + 1.182s (loader) + 1.836s (kernel) + 6.538s (initrd) + 46.821s (userspace) = 1min 10.853s
graphical.target reached after 46.810s in userspace
Startup finished in 14.476s (firmware) + 2.441s (loader) + 2.053s (kernel) + 9.937s (initrd) + 38.799s (userspace) = 1min 7.709s
graphical.target reached after 38.789s in userspace
Startup finished in 14.481s (firmware) + 1.196s (loader) + 1.953s (kernel) + 6.067s (initrd) + 43.598s (userspace) = 1min 7.297s
graphical.target reached after 43.590s in userspace
dom0 has 8 vcpu (dom0_max_vcpus=8)
Startup finished in 14.476s (firmware) + 1.197s (loader) + 1.820s (kernel) + 5.173s (initrd) + 35.903s (userspace) = 58.571s
graphical.target reached after 35.895s in userspace
dom0 has 4 vcpu (dom0_max_vcpus=4)
Startup finished in 14.469s (firmware) + 1.311s (loader) + 1.819s (kernel) + 5.198s (initrd) + 34.018s (userspace) = 56.818s
graphical.target reached after 34.010s in userspace
dom0 has 2 vcpu (dom0_max_vcpus=2)
Startup finished in 14.464s (firmware) + 1.247s (loader) + 1.925s (kernel) + 5.808s (initrd) + 32.177s (userspace) = 55.623s
graphical.target reached after 32.160s in userspace
Startup finished in 14.481s (firmware) + 1.180s (loader) + 1.910s (kernel) + 5.587s (initrd) + 34.100s (userspace) = 57.260s
graphical.target reached after 34.087s in userspace
dom0 has 1 vcpu (dom0_max_vcpus=1)
Startup finished in 14.476s (firmware) + 1.743s (loader) + 1.943s (kernel) + 7.411s (initrd) + 43.706s (userspace) = 1min 9.281s
graphical.target reached after 43.678s in userspace
BIOS smt=off xen smt=off
Summary
dom0 has all 16 vcpu (no dom0_max_vcpus specified)
Startup finished in 13.935s (firmware) + 1.834s (loader) + 1.854s (kernel) + 6.726s (initrd) + 1min 747ms (userspace) = 1min 25.096s
graphical.target reached after 1min 713ms in userspace
Startup finished in 14.459s (firmware) + 1.150s (loader) + 1.855s (kernel) + 6.844s (initrd) + 51.702s (userspace) = 1min 16.013s
graphical.target reached after 51.679s in userspace
Startup finished in 14.463s (firmware) + 1.315s (loader) + 1.852s (kernel) + 7.009s (initrd) + 50.996s (userspace) = 1min 15.637s
graphical.target reached after 50.981s in userspace
Startup finished in 14.446s (firmware) + 2.775s (loader) + 1.953s (kernel) + 11.078s (initrd) + 51.386s (userspace) = 1min 21.641s
graphical.target reached after 51.364s in userspace
Startup finished in 14.455s (firmware) + 1.613s (loader) + 1.853s (kernel) + 6.743s (initrd) + 52.357s (userspace) = 1min 17.024s
graphical.target reached after 52.326s in userspace
Startup finished in 14.459s (firmware) + 1.348s (loader) + 1.844s (kernel) + 6.985s (initrd) + 49.663s (userspace) = 1min 14.301s
graphical.target reached after 49.625s in userspace
Startup finished in 13.929s (firmware) + 1.253s (loader) + 1.921s (kernel) + 7.170s (initrd) + 58.594s (userspace) = 1min 22.869s
graphical.target reached after 58.560s in userspace
Startup finished in 14.465s (firmware) + 1.017s (loader) + 1.959s (kernel) + 7.183s (initrd) + 58.360s (userspace) = 1min 22.987s
graphical.target reached after 58.323s in userspace
dom0 has 8 vcpu (dom0_max_vcpus=8)
Startup finished in 14.446s (firmware) + 1.279s (loader) + 1.821s (kernel) + 6.579s (initrd) + 46.048s (userspace) = 1min 10.174s
graphical.target reached after 46.032s in userspace
dom0 has 2 vcpu (dom0_max_vcpus=2)
Startup finished in 14.427s (firmware) + 1.279s (loader) + 1.917s (kernel) + 6.493s (initrd) + 38.787s (userspace) = 1min 2.907s
graphical.target reached after 38.774s in userspace
Startup finished in 14.446s (firmware) + 1.152s (loader) + 2.057s (kernel) + 6.038s (initrd) + 34.954s (userspace) = 58.649s
graphical.target reached after 34.936s in userspace
dom0 has 1 vcpu (dom0_max_vcpus=1)
Startup finished in 14.467s (firmware) + 1.150s (loader) + 1.848s (kernel) + 7.492s (initrd) + 45.275s (userspace) = 1min 10.234s
graphical.target reached after 45.252s in userspace
Hyper-threading can deliver a performance improvement through keeping CPU units evenly busy, especially in long pipelines, but in my experience the gain was never higher than 15 or 20% anyway (despite CPU manufacturers’ claims). However, it does seem to bring some potentially serious drawbacks (for example, cache leaking, etc.). Now, looking at the big picture, you could say the same about superscalar CISC CPUs and long pipelines (for example, issues with leaking instructions and data due to speculative execution and branch prediction, etc.). In sum, there is no free lunch and maybe we should just return to simpler RISC CPU designs, but that doesn’t scale either because you can’t just increase frequency, reduce size and increase power forever, can you?
I wanted to check if this affected my boot time, but there seems no reliable way to test as the boot time differs all the time. Did 3 tests without changing any settings and the boot times vary greatly.
I repeated the boot benchmark a few times and I have that 40% speedup fairly consistently with SMT disabled in BIOS. Most of the time savings seem to be around starting VMs, as you can see from the numbers below. Again, this is on a Thinkpad t470s with a 6300u Skylake processor, with all of the latest patches and Qubes OS 4.1, in case that matters.