CPU Pinning Alder Lake

renehoj · April 21, 2023, 5:55am

Are you using systemd to run the script automatically, or is there some other way to get Qubes OS to run the script?

taradiddles · April 21, 2023, 5:59am

@noskb : brilliant ! I added a try/except KeyboardInterrupt handler (cleaner shutdown) and exec it at boot time with a systemd service - it works great!

[edit - adding _vcpu_pin("Domain-0", E_CORES) pins all dom0 vcpus to E cores at boot time - when using dom0_max_vcpus=X dom0_vcpus_pin xen options]

taradiddles · April 21, 2023, 6:04am

@renehoj - missed your message. You can run the script with systemd; alternatively you could put a .desktop file in ~/.config/autostart (or use XFCE’s session/autostart config gui) - but this won’t work for qubes configured to start at boot time (/before Xorg).

Here’s what I’m using:

[Unit]
Description=Qubes CPU pinning
After=qubesd.service

[Service]
ExecStart=/usr/local/bin/cpu_pinning.py

[Install]
WantedBy=multi-user.target

[edit - fixed After= dependency]

noskb · April 21, 2023, 2:16pm

It seems enough to determine whether hvm. Sorry for the messy code…

def pin_by_tags(vm, event, **kwargs):
    vm = app.domains[str(vm)]
    if tag in list(vm.tags):
        if str(getattr(vm, 'virt_mode')) == 'hvm':
            _vcpu_pin(vm.name+'-dm', P_CORES)
        _vcpu_pin(vm.name, P_CORES)
        print(f'Pinned {vm.name} to P-cores')
    else:
        if str(getattr(vm, 'virt_mode')) == 'hvm':
            _vcpu_pin(vm.name+'-dm', E_CORES)
        _vcpu_pin(vm.name, E_CORES)
        print(f'Pinned {vm.name} to E-cores')

renehoj · April 21, 2023, 2:31pm

Thanks, I’ll try it tonight.

renehoj · April 23, 2023, 8:00pm

Adding “cpufreq=xen:performance max_cstate=1” to GRUB_CMDLINE set the default state to performance, which seems to slightly improve boot speed on my system.

I use a boot script to change the state to ondemand 60 sec after boot.

taradiddles · April 24, 2023, 6:14am

Thank you for posting your findings.
I’ve just tried with “cpufreq=xen:performance” (without the “max_cstate” param) ; boot time with ondemand, from luks password prompt to a working env (sys-net, sys-firewall, 2 VMs with firefox, 1 VM with evolution, 1 VM with xterm): 59 seconds. With performance: 45 seconds. Not bad.
With ‘performance’ I can hear the CPU fan spinning a bit faster when all the VMs are starting ; it stays at min rpm with ‘ondemand’ ; I don’t see any difference in CPU temp with either schedulers when the system is idle. When I have a bit more time I’ll try to measure power consumption.
FWIW the Citrix doc recommends using ‘performance’ for Intel CPUs; quoting: “By contrast, Intel processors typically save power by entering deep C-states so the “performance” governor is preferred.”)

renehoj · April 24, 2023, 7:09am

I’ve also been considering if using the performance governor is just better.

I’ll start using a watt meter on my PC, so I can compare the power consumption between performance and ondemand. With the current energy prices, a slight performance increase isn’t worth a large increase in power consumption.

I’ll also try and see if there is a way to make the xen default affinity to only be the P cores. Current new qubes start with “All” affinity, it would probably be better to always use the P cores.

taradiddles · April 24, 2023, 8:08am

So, I got curious and did a few time measurements with performance/ondemand as well as various dom0 cpu pinning configurations (I was for instance wondering what was the rationale/impact of pinning dom0 to E-cores).

tl;dr; ‘strict’ 1:1 dom0 cpu pinning to E-cores with the performance scheduler on P-cores gives the most consistent, lowest starting times, representing a 20% performance increase over using the ondemand scheduler with no dom0 pinning.

Time in seconds for qvm-start to complete; 20 iterations, 6 vcpus assigned to dom0, on a i13600k (6 Pcores, 8 Ecores):

	median	min	max	mean	pstdev
ondemand / no dom0 cpu pinning (0-13)	9.5	7.9	10.0	9.2	0.7
ondemand / dom0 pinned on all E cores (6-13)	9.5	7.6	10.0	9.0	0.9
ondemand / dom0 1:1 E core pinning (6->11) *	7.9	7.6	9.9	8.1	0.7
ondemand / dom0 1:1 P core pinning (0->5)	11.6	11.2	11.7	11.6	0.2
performance / no dom0 cpu pinning (0-13)	9.2	7.6	9.9	8.9	0.9
performance / dom0 1:1 E core pinning (6->11) *	7.6	7.3	7.9	7.6	0.2
performance / dom0 1:1 P core pinning (0->5)	9.0	8.7	9.2	9.0	0.2

* ‘1:1 E core pinning’ means pinning dom0 vcpu 0 to physical core 6 (E core #0), vcpu 1 to core 7 (E core #1), and so on to avoid dynamic reshuffling

Findings:

better times with dom0 pinned to E cores.
As expected, ‘performance’ fares better than ‘ondemand’ (for that specific load) but the difference is minimal.
‘1:1’ dom0 CPU pinning was always better than dynamic pinning, with lower and more consistent (stdev) load times, likely because of L1/L2 cache hits of all the vm management stuff (qubesd, libvirt, …)
‘performance’ together with dom0 ‘1:1’ CPU E-core pinning exhibited the lowest and most consistent startup times.

Obviously the above might not be true when running heavy concurrent workloads (in that case it would be interesting to see how to tweak Xen not to reshuffle CPUs too aggressively). In my case the PC is idling most of the time and starting VMs the fastest possible is important.

[edit - added dom0 1:1 P core results]

Asterium · April 24, 2023, 10:29am

You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?

taradiddles · April 24, 2023, 2:14pm

You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?

I’m using @noskb’s python program to pin specific qubes to P-cores (the default is to E-cores). For the ‘1:1’ mapping of dom0 vcpus to E-cores I hacked a shell script (below), it was quicker than adding the functionality in python. But if you just want to pin dom0 vcpus to any E-core, simply add _vcpu_pin("Domain-0", E_CORES) to @noskb’s program.

#!/bin/sh

ECORE_START=6

NB_VCPUS=$(xl vcpu-list | grep -c '^Domain-0')

echo "Pinning dom0 $NB_VCPUS vcpus to E-cores ${ECORE_START}-$(($ECORE_START + $NB_VCPUS))"

for dom0_vcpu in $(seq 0 $(($NB_VCPUS - 1))); do
        xl vcpu-pin Domain-0 $dom0_vcpu $(($ECORE_START + $dom0_vcpu))
done

Asterium · April 24, 2023, 3:38pm

It is strange, but cpufreq=xen:performance does not work, and when I manually change xenpm set-scaling-governor perfomance, I get the error failed to set governor name (invalid argument)

Update
Everything works. Sorry for the confusion, it turns out I entered perfomance instead of perfoRmance:)

renehoj · April 24, 2023, 7:05pm

I have made a very casual tests of the power consumption of performance vs. ondemand, just running my default set up, keeping an eye on the watt meter.

I have a total of 15 qubes including dom0 and sys qubes, 4 of which are running browsers one of which is streaming 1080p video.

With ondemand the average is around 130-140w, and with performance it’s closer to 170-180w. My PC has GPU, 2 NVMe drives, 2 HDD drives, and 2 PCI USB controllers, which all increase the total consumption.

I’ll do some longer test this week, but my initial impression is that performance has substantial higher power consumption.

Asterium · April 25, 2023, 5:28pm

In my case e-core assignment worked worse than dynamic affinity (without pinning).

dom0 script

#!/usr/bin/bash

qube="debian-11-minimal"

get_real_time() {
  realtime="$(/usr/bin/time -f "%e" qvm-start -q ${qube})"
  qvm-shutdown --wait -q "${qube}"
  echo $realtime 
}

benchmark() {
  qvm-shutdown --all --wait -q
  for ((i = 0; i < 10 ; i++)); do
    sleep 15
    echo "$(get_real_time)"
  done
}

benchmark

Python script for calculations

import statistics

data = [
    {'name': 'dynamic core ondemand', 'numbers': [4.40, 4.25, 4.41, 4.24, 4.39, 4.41, 4.39, 4.25, 4.47, 4.38]},
    {'name': 'dynamic core performance', 'numbers': [4.58, 4.25, 4.26, 4.37, 4.39, 4.56, 4.45, 4.65, 4.41, 4.36]},
    {'name': 'e-core ondemand', 'numbers': [6.03, 6.01, 5.79, 5.62, 5.76, 5.99, 5.98, 5.70, 5.66, 5.75]},
    {'name': 'e-core performance', 'numbers': [4.93, 4.86, 4.89, 4.90, 4.98, 5.02, 4.87, 4.84, 4.87, 5.01]},
]

with open("output.md", "w") as output_file:
    output_file.write("|| Min | Max | Mean | Median | Standard deviation |\n")
    output_file.write("|---|---|---|---|---|---|\n")
    for row in data:
        numbers = row['numbers']
        minimum = min(numbers)
        maximum = max(numbers)
        mean = statistics.mean(numbers)
        median = statistics.median(numbers)
        std_dev = statistics.pstdev(numbers)
        output_file.write(f"| {row['name']} | {minimum} | {maximum} | {mean} | {median} | {std_dev} |\n")

8 vcpus assigned to dom0, all appvm(qube) set to p-core on i7-13700k (8 Pcores, 8 Ecores):

	Min	Max	Mean	Median	Standard deviation
dynamic ondemand	4.24	4.47	4.359	4.39	0.07712976079309458
dynamic performance	4.25	4.65	4.428	4.4	0.12647529403009905
e-core ondemand	5.62	6.03	5.829	5.775	0.1494289128649473
e-core performance	4.84	5.02	4.917	4.895	0.06165225056719332

renehoj · April 25, 2023, 5:51pm

I think you get the lowest possible time by using 4 p-cores for dom0 and the other 4 p-cores for the test vm, all running in performance mode.

The E cores are Intel Atom CPUs, they are designed to have a very low power consumption, they are not designed for high performance. 1 E core does not give you the same performance as 1 P core, when if they are running at the same clock speed. I don’t know what the ratio for the Alder Lake cores are, but I’ve read that standard Pentium cores were +50% faster than Atom cores at the same clock speed.

Asterium · April 25, 2023, 6:06pm

In fact, my goal is to find a better place for the e-core in the qubes so that they do their job, but without sacrificing overall performance.

renehoj · April 25, 2023, 6:27pm

You just have to pick qubes that can’t take advantage of the P cores and pin them to the E cores, freeing up more resources on the P cores for qubes that can use them.

I put whonix, Dom0, templates, sys-net, sys-firewall, sys-vpn, disp-mgmt on the E cores, I don’t think there is any meaningful way they can take advantage of running on faster cores.

It frees up the P cores for running my “user” qubes.

Asterium · April 25, 2023, 8:41pm

Yes, you are right, using p cores for dom0 and test vm, I got the best time:

Min	Max	Mean	Median	pstdev
2.97	3.47	3.20	3.19	0.10

I have a few questions:

how much dom0_max_vcpus are you using?
Are you using any specific e cores or dynamically 16-23?
Do you turn on governor performance for e cores?

renehoj · April 25, 2023, 9:16pm

I’m using 4 cores for Dom0, and 4 cores for the other system qubes.

Dom0 is 20-23, the rest is 16-19, it just let Xen do the placements it’s too much work micromanaging the pinning on core level.

I’m doing some test to see what the power consumption difference is between ondemand and performance, until I have an idea of the difference I don’t want to just run everything in performance.

I think running dom0 in performance mode makes sense, but then you might also want to reduce the cores to just 2, I don’t think it really benefits from 4 dedicated cores.

renehoj · April 26, 2023, 6:48am

I have been looking at cpumask and cpupool to control on what cores a new vm is created, the idea being that you could make a small group of cores what are running in performance mode to allow fast start up of new VMs, and once they are running they are moved to ondemand cores.

You can’t use cpumask, it doesn’t allow you to pin/use cores outside the mask.

It seems to work with cpupools, all new VMs are created in pool0, the only limitation I have found is that also dom0 has to be in pool0.

I was thinking about doing something like this
0-3 pool0 always in performance mode, for dom0 and starting VMs
4-15 p-cores
16-23 e-cores

It requires a bit of configuration by the boot scripts, you need to set up the pools every time the system boots, and you need to migrate the vm to the pool you want it to be running in.