Thanks for sharing this! On some chips, smt needs to be on to ensure all of its cores can transit into C-states deeper than C1, thus prolonging battery life. With this cpu pinning technique, we can enable smt while still use only one of the two threads each core.
I have been looking at cpupools maybe itās better for the use case where you just want to disable the sibling cores.
Maybe you can just change Pool-0 and it will automatically apply to all qubes.
There is also an approach to using qubesadmin.events.
The following is a simple example of moving a VM to the P-cores if it has a āperformanceā tag when it starts up, and to the E-cores otherwise.
#!/usr/bin/env python3
import asyncio
import subprocess
import qubesadmin
import qubesadmin.events
# i5-13600k (smt=off)
P_CORES = '0-5'
E_CORES = '6-13'
tag = 'performance'
def _vcpu_pin(name, cores):
cmd = ['xl', 'vcpu-pin', name, 'all', cores]
subprocess.run(cmd).check_returncode()
def pin_by_tag(vm, event, **kwargs):
vm = app.domains[str(vm)]
if tag in list(vm.tags):
_vcpu_pin(vm.name, P_CORES)
print(f'Pinned {vm.name} to P-cores')
else:
_vcpu_pin(vm.name, E_CORES)
print(f'Pinned {vm.name} to E-cores')
app = qubesadmin.Qubes()
dispatcher = qubesadmin.events.EventsDispatcher(app)
dispatcher.add_handler('domain-start', pin_by_tag)
asyncio.run(dispatcher.listen_for_events())
How do you get qubes to execute the script?
I looked at the admin events to move the -dm qubes started by HVM qubes, because they canāt be configured using xen.xml, but I couldnāt figure out how you use the pyhton scripts.
I forgot to mention that the script is meant to be run within the dom0
Are you using systemd to run the script automatically, or is there some other way to get Qubes OS to run the script?
@noskb : brilliant ! I added a try/except KeyboardInterrupt handler (cleaner shutdown) and exec it at boot time with a systemd service - it works great!
[edit - adding _vcpu_pin("Domain-0", E_CORES)
pins all dom0 vcpus to E cores at boot time - when using dom0_max_vcpus=X dom0_vcpus_pin
xen options]
@renehoj - missed your message. You can run the script with systemd; alternatively you could put a .desktop file in ~/.config/autostart (or use XFCEās session/autostart config gui) - but this wonāt work for qubes configured to start at boot time (/before Xorg).
Hereās what Iām using:
[Unit]
Description=Qubes CPU pinning
After=qubesd.service
[Service]
ExecStart=/usr/local/bin/cpu_pinning.py
[Install]
WantedBy=multi-user.target
[edit - fixed After=
dependency]
It seems enough to determine whether hvm. Sorry for the messy codeā¦
def pin_by_tags(vm, event, **kwargs):
vm = app.domains[str(vm)]
if tag in list(vm.tags):
if str(getattr(vm, 'virt_mode')) == 'hvm':
_vcpu_pin(vm.name+'-dm', P_CORES)
_vcpu_pin(vm.name, P_CORES)
print(f'Pinned {vm.name} to P-cores')
else:
if str(getattr(vm, 'virt_mode')) == 'hvm':
_vcpu_pin(vm.name+'-dm', E_CORES)
_vcpu_pin(vm.name, E_CORES)
print(f'Pinned {vm.name} to E-cores')
Thanks, Iāll try it tonight.
Adding ācpufreq=xen:performance max_cstate=1ā to GRUB_CMDLINE set the default state to performance, which seems to slightly improve boot speed on my system.
I use a boot script to change the state to ondemand 60 sec after boot.
Thank you for posting your findings.
Iāve just tried with ācpufreq=xen:performanceā (without the āmax_cstateā param) ; boot time with ondemand, from luks password prompt to a working env (sys-net, sys-firewall, 2 VMs with firefox, 1 VM with evolution, 1 VM with xterm): 59 seconds. With performance: 45 seconds. Not bad.
With āperformanceā I can hear the CPU fan spinning a bit faster when all the VMs are starting ; it stays at min rpm with āondemandā ; I donāt see any difference in CPU temp with either schedulers when the system is idle. When I have a bit more time Iāll try to measure power consumption.
FWIW the Citrix doc recommends using āperformanceā for Intel CPUs; quoting: āBy contrast, Intel processors typically save power by entering deep C-states so the āperformanceā governor is preferred.ā)
Iāve also been considering if using the performance governor is just better.
Iāll start using a watt meter on my PC, so I can compare the power consumption between performance and ondemand. With the current energy prices, a slight performance increase isnāt worth a large increase in power consumption.
Iāll also try and see if there is a way to make the xen default affinity to only be the P cores. Current new qubes start with āAllā affinity, it would probably be better to always use the P cores.
So, I got curious and did a few time measurements with performance/ondemand as well as various dom0 cpu pinning configurations (I was for instance wondering what was the rationale/impact of pinning dom0 to E-cores).
tl;dr; āstrictā 1:1 dom0 cpu pinning to E-cores with the performance scheduler on P-cores gives the most consistent, lowest starting times, representing a 20% performance increase over using the ondemand scheduler with no dom0 pinning.
Time in seconds for qvm-start
to complete; 20 iterations, 6 vcpus assigned to dom0, on a i13600k (6 Pcores, 8 Ecores):
median | min | max | mean | pstdev | |
---|---|---|---|---|---|
ondemand / no dom0 cpu pinning (0-13) | 9.5 | 7.9 | 10.0 | 9.2 | 0.7 |
ondemand / dom0 pinned on all E cores (6-13) | 9.5 | 7.6 | 10.0 | 9.0 | 0.9 |
ondemand / dom0 1:1 E core pinning (6->11) * | 7.9 | 7.6 | 9.9 | 8.1 | 0.7 |
ondemand / dom0 1:1 P core pinning (0->5) | 11.6 | 11.2 | 11.7 | 11.6 | 0.2 |
performance / no dom0 cpu pinning (0-13) | 9.2 | 7.6 | 9.9 | 8.9 | 0.9 |
performance / dom0 1:1 E core pinning (6->11) * | 7.6 | 7.3 | 7.9 | 7.6 | 0.2 |
performance / dom0 1:1 P core pinning (0->5) | 9.0 | 8.7 | 9.2 | 9.0 | 0.2 |
* ā1:1 E core pinningā means pinning dom0 vcpu 0 to physical core 6 (E core #0), vcpu 1 to core 7 (E core #1), and so on to avoid dynamic reshuffling
Findings:
- better times with dom0 pinned to E cores.
- As expected, āperformanceā fares better than āondemandā (for that specific load) but the difference is minimal.
- ā1:1ā dom0 CPU pinning was always better than dynamic pinning, with lower and more consistent (stdev) load times, likely because of L1/L2 cache hits of all the vm management stuff (qubesd, libvirt, ā¦)
- āperformanceā together with dom0 ā1:1ā CPU E-core pinning exhibited the lowest and most consistent startup times.
Obviously the above might not be true when running heavy concurrent workloads (in that case it would be interesting to see how to tweak Xen not to reshuffle CPUs too aggressively). In my case the PC is idling most of the time and starting VMs the fastest possible is important.
[edit - added dom0 1:1 P core results]
You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?
You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?
Iām using @noskbās python program to pin specific qubes to P-cores (the default is to E-cores). For the ā1:1ā mapping of dom0 vcpus to E-cores I hacked a shell script (below), it was quicker than adding the functionality in python. But if you just want to pin dom0 vcpus to any E-core, simply add _vcpu_pin("Domain-0", E_CORES)
to @noskbās program.
#!/bin/sh
ECORE_START=6
NB_VCPUS=$(xl vcpu-list | grep -c '^Domain-0')
echo "Pinning dom0 $NB_VCPUS vcpus to E-cores ${ECORE_START}-$(($ECORE_START + $NB_VCPUS))"
for dom0_vcpu in $(seq 0 $(($NB_VCPUS - 1))); do
xl vcpu-pin Domain-0 $dom0_vcpu $(($ECORE_START + $dom0_vcpu))
done
It is strange, but cpufreq=xen:performance
does not work, and when I manually change xenpm set-scaling-governor perfomance
, I get the error failed to set governor name (invalid argument)
Update
Everything works. Sorry for the confusion, it turns out I entered perfomance instead of perfoRmance:)
I have made a very casual tests of the power consumption of performance vs. ondemand, just running my default set up, keeping an eye on the watt meter.
I have a total of 15 qubes including dom0 and sys qubes, 4 of which are running browsers one of which is streaming 1080p video.
With ondemand the average is around 130-140w, and with performance itās closer to 170-180w. My PC has GPU, 2 NVMe drives, 2 HDD drives, and 2 PCI USB controllers, which all increase the total consumption.
Iāll do some longer test this week, but my initial impression is that performance has substantial higher power consumption.
In my case e-core assignment worked worse than dynamic affinity (without pinning).
dom0 script
#!/usr/bin/bash
qube="debian-11-minimal"
get_real_time() {
realtime="$(/usr/bin/time -f "%e" qvm-start -q ${qube})"
qvm-shutdown --wait -q "${qube}"
echo $realtime
}
benchmark() {
qvm-shutdown --all --wait -q
for ((i = 0; i < 10 ; i++)); do
sleep 15
echo "$(get_real_time)"
done
}
benchmark
Python script for calculations
import statistics
data = [
{'name': 'dynamic core ondemand', 'numbers': [4.40, 4.25, 4.41, 4.24, 4.39, 4.41, 4.39, 4.25, 4.47, 4.38]},
{'name': 'dynamic core performance', 'numbers': [4.58, 4.25, 4.26, 4.37, 4.39, 4.56, 4.45, 4.65, 4.41, 4.36]},
{'name': 'e-core ondemand', 'numbers': [6.03, 6.01, 5.79, 5.62, 5.76, 5.99, 5.98, 5.70, 5.66, 5.75]},
{'name': 'e-core performance', 'numbers': [4.93, 4.86, 4.89, 4.90, 4.98, 5.02, 4.87, 4.84, 4.87, 5.01]},
]
with open("output.md", "w") as output_file:
output_file.write("|| Min | Max | Mean | Median | Standard deviation |\n")
output_file.write("|---|---|---|---|---|---|\n")
for row in data:
numbers = row['numbers']
minimum = min(numbers)
maximum = max(numbers)
mean = statistics.mean(numbers)
median = statistics.median(numbers)
std_dev = statistics.pstdev(numbers)
output_file.write(f"| {row['name']} | {minimum} | {maximum} | {mean} | {median} | {std_dev} |\n")
8 vcpus assigned to dom0, all appvm(qube) set to p-core on i7-13700k (8 Pcores, 8 Ecores):
Min | Max | Mean | Median | Standard deviation | |
---|---|---|---|---|---|
dynamic ondemand | 4.24 | 4.47 | 4.359 | 4.39 | 0.07712976079309458 |
dynamic performance | 4.25 | 4.65 | 4.428 | 4.4 | 0.12647529403009905 |
e-core ondemand | 5.62 | 6.03 | 5.829 | 5.775 | 0.1494289128649473 |
e-core performance | 4.84 | 5.02 | 4.917 | 4.895 | 0.06165225056719332 |
I think you get the lowest possible time by using 4 p-cores for dom0 and the other 4 p-cores for the test vm, all running in performance mode.
The E cores are Intel Atom CPUs, they are designed to have a very low power consumption, they are not designed for high performance. 1 E core does not give you the same performance as 1 P core, when if they are running at the same clock speed. I donāt know what the ratio for the Alder Lake cores are, but Iāve read that standard Pentium cores were +50% faster than Atom cores at the same clock speed.