CPU Pinning Alder Lake

Thanks for sharing this! On some chips, smt needs to be on to ensure all of its cores can transit into C-states deeper than C1, thus prolonging battery life. With this cpu pinning technique, we can enable smt while still use only one of the two threads each core.

I have been looking at cpupools maybe itā€™s better for the use case where you just want to disable the sibling cores.

Maybe you can just change Pool-0 and it will automatically apply to all qubes.

1 Like

There is also an approach to using qubesadmin.events.

The following is a simple example of moving a VM to the P-cores if it has a ā€œperformanceā€ tag when it starts up, and to the E-cores otherwise.

#!/usr/bin/env python3

import asyncio
import subprocess

import qubesadmin
import qubesadmin.events

# i5-13600k (smt=off)
P_CORES = '0-5'
E_CORES = '6-13'

tag = 'performance'

def _vcpu_pin(name, cores):
    cmd = ['xl', 'vcpu-pin', name, 'all', cores]   
    subprocess.run(cmd).check_returncode()

def pin_by_tag(vm, event, **kwargs):
    vm = app.domains[str(vm)]
    if tag in list(vm.tags):
        _vcpu_pin(vm.name, P_CORES)
        print(f'Pinned {vm.name} to P-cores')
    else:
        _vcpu_pin(vm.name, E_CORES)
        print(f'Pinned {vm.name} to E-cores')

app = qubesadmin.Qubes()
dispatcher = qubesadmin.events.EventsDispatcher(app)
dispatcher.add_handler('domain-start', pin_by_tag)
asyncio.run(dispatcher.listen_for_events())
5 Likes

How do you get qubes to execute the script?

I looked at the admin events to move the -dm qubes started by HVM qubes, because they canā€™t be configured using xen.xml, but I couldnā€™t figure out how you use the pyhton scripts.

I forgot to mention that the script is meant to be run within the dom0

Are you using systemd to run the script automatically, or is there some other way to get Qubes OS to run the script?

@noskb : brilliant ! I added a try/except KeyboardInterrupt handler (cleaner shutdown) and exec it at boot time with a systemd service - it works great!

[edit - adding _vcpu_pin("Domain-0", E_CORES) pins all dom0 vcpus to E cores at boot time - when using dom0_max_vcpus=X dom0_vcpus_pin xen options]

1 Like

@renehoj - missed your message. You can run the script with systemd; alternatively you could put a .desktop file in ~/.config/autostart (or use XFCEā€™s session/autostart config gui) - but this wonā€™t work for qubes configured to start at boot time (/before Xorg).

Hereā€™s what Iā€™m using:

[Unit]
Description=Qubes CPU pinning
After=qubesd.service

[Service]
ExecStart=/usr/local/bin/cpu_pinning.py

[Install]
WantedBy=multi-user.target

[edit - fixed After= dependency]

2 Likes

It seems enough to determine whether hvm. Sorry for the messy codeā€¦

def pin_by_tags(vm, event, **kwargs):
    vm = app.domains[str(vm)]
    if tag in list(vm.tags):
        if str(getattr(vm, 'virt_mode')) == 'hvm':
            _vcpu_pin(vm.name+'-dm', P_CORES)
        _vcpu_pin(vm.name, P_CORES)
        print(f'Pinned {vm.name} to P-cores')
    else:
        if str(getattr(vm, 'virt_mode')) == 'hvm':
            _vcpu_pin(vm.name+'-dm', E_CORES)
        _vcpu_pin(vm.name, E_CORES)
        print(f'Pinned {vm.name} to E-cores')
1 Like

Thanks, Iā€™ll try it tonight.

Adding ā€œcpufreq=xen:performance max_cstate=1ā€ to GRUB_CMDLINE set the default state to performance, which seems to slightly improve boot speed on my system.

I use a boot script to change the state to ondemand 60 sec after boot.

Thank you for posting your findings.
Iā€™ve just tried with ā€œcpufreq=xen:performanceā€ (without the ā€œmax_cstateā€ param) ; boot time with ondemand, from luks password prompt to a working env (sys-net, sys-firewall, 2 VMs with firefox, 1 VM with evolution, 1 VM with xterm): 59 seconds. With performance: 45 seconds. Not bad.
With ā€˜performanceā€™ I can hear the CPU fan spinning a bit faster when all the VMs are starting ; it stays at min rpm with ā€˜ondemandā€™ ; I donā€™t see any difference in CPU temp with either schedulers when the system is idle. When I have a bit more time Iā€™ll try to measure power consumption.
FWIW the Citrix doc recommends using ā€˜performanceā€™ for Intel CPUs; quoting: ā€œBy contrast, Intel processors typically save power by entering deep C-states so the ā€œperformanceā€ governor is preferred.ā€)

Iā€™ve also been considering if using the performance governor is just better.

Iā€™ll start using a watt meter on my PC, so I can compare the power consumption between performance and ondemand. With the current energy prices, a slight performance increase isnā€™t worth a large increase in power consumption.

Iā€™ll also try and see if there is a way to make the xen default affinity to only be the P cores. Current new qubes start with ā€œAllā€ affinity, it would probably be better to always use the P cores.

1 Like

So, I got curious and did a few time measurements with performance/ondemand as well as various dom0 cpu pinning configurations (I was for instance wondering what was the rationale/impact of pinning dom0 to E-cores).

tl;dr; ā€˜strictā€™ 1:1 dom0 cpu pinning to E-cores with the performance scheduler on P-cores gives the most consistent, lowest starting times, representing a 20% performance increase over using the ondemand scheduler with no dom0 pinning.

Time in seconds for qvm-start to complete; 20 iterations, 6 vcpus assigned to dom0, on a i13600k (6 Pcores, 8 Ecores):

median min max mean pstdev
ondemand / no dom0 cpu pinning (0-13) 9.5 7.9 10.0 9.2 0.7
ondemand / dom0 pinned on all E cores (6-13) 9.5 7.6 10.0 9.0 0.9
ondemand / dom0 1:1 E core pinning (6->11) * 7.9 7.6 9.9 8.1 0.7
ondemand / dom0 1:1 P core pinning (0->5) 11.6 11.2 11.7 11.6 0.2
performance / no dom0 cpu pinning (0-13) 9.2 7.6 9.9 8.9 0.9
performance / dom0 1:1 E core pinning (6->11) * 7.6 7.3 7.9 7.6 0.2
performance / dom0 1:1 P core pinning (0->5) 9.0 8.7 9.2 9.0 0.2

* ā€˜1:1 E core pinningā€™ means pinning dom0 vcpu 0 to physical core 6 (E core #0), vcpu 1 to core 7 (E core #1), and so on to avoid dynamic reshuffling

Findings:

  • better times with dom0 pinned to E cores.
  • As expected, ā€˜performanceā€™ fares better than ā€˜ondemandā€™ (for that specific load) but the difference is minimal.
  • ā€˜1:1ā€™ dom0 CPU pinning was always better than dynamic pinning, with lower and more consistent (stdev) load times, likely because of L1/L2 cache hits of all the vm management stuff (qubesd, libvirt, ā€¦)
  • ā€˜performanceā€™ together with dom0 ā€˜1:1ā€™ CPU E-core pinning exhibited the lowest and most consistent startup times.

Obviously the above might not be true when running heavy concurrent workloads (in that case it would be interesting to see how to tweak Xen not to reshuffle CPUs too aggressively). In my case the PC is idling most of the time and starting VMs the fastest possible is important.

[edit - added dom0 1:1 P core results]

4 Likes

You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?

You using qubesadmin.events to pinning e-core for Domain-0 and p-core for the rest of the qube?

Iā€™m using @noskbā€™s python program to pin specific qubes to P-cores (the default is to E-cores). For the ā€˜1:1ā€™ mapping of dom0 vcpus to E-cores I hacked a shell script (below), it was quicker than adding the functionality in python. But if you just want to pin dom0 vcpus to any E-core, simply add _vcpu_pin("Domain-0", E_CORES) to @noskbā€™s program.

#!/bin/sh

ECORE_START=6

NB_VCPUS=$(xl vcpu-list | grep -c '^Domain-0')

echo "Pinning dom0 $NB_VCPUS vcpus to E-cores ${ECORE_START}-$(($ECORE_START + $NB_VCPUS))"

for dom0_vcpu in $(seq 0 $(($NB_VCPUS - 1))); do
        xl vcpu-pin Domain-0 $dom0_vcpu $(($ECORE_START + $dom0_vcpu))
done
1 Like

It is strange, but cpufreq=xen:performance does not work, and when I manually change xenpm set-scaling-governor perfomance, I get the error failed to set governor name (invalid argument)

Update
Everything works. Sorry for the confusion, it turns out I entered perfomance instead of perfoRmance:)

1 Like

I have made a very casual tests of the power consumption of performance vs. ondemand, just running my default set up, keeping an eye on the watt meter.

I have a total of 15 qubes including dom0 and sys qubes, 4 of which are running browsers one of which is streaming 1080p video.

With ondemand the average is around 130-140w, and with performance itā€™s closer to 170-180w. My PC has GPU, 2 NVMe drives, 2 HDD drives, and 2 PCI USB controllers, which all increase the total consumption.

Iā€™ll do some longer test this week, but my initial impression is that performance has substantial higher power consumption.

2 Likes

In my case e-core assignment worked worse than dynamic affinity (without pinning).

dom0 script
#!/usr/bin/bash

qube="debian-11-minimal"

get_real_time() {
  realtime="$(/usr/bin/time -f "%e" qvm-start -q ${qube})"
  qvm-shutdown --wait -q "${qube}"
  echo $realtime 
}

benchmark() {
  qvm-shutdown --all --wait -q
  for ((i = 0; i < 10 ; i++)); do
    sleep 15
    echo "$(get_real_time)"
  done
}

benchmark
Python script for calculations
import statistics

data = [
    {'name': 'dynamic core ondemand', 'numbers': [4.40, 4.25, 4.41, 4.24, 4.39, 4.41, 4.39, 4.25, 4.47, 4.38]},
    {'name': 'dynamic core performance', 'numbers': [4.58, 4.25, 4.26, 4.37, 4.39, 4.56, 4.45, 4.65, 4.41, 4.36]},
    {'name': 'e-core ondemand', 'numbers': [6.03, 6.01, 5.79, 5.62, 5.76, 5.99, 5.98, 5.70, 5.66, 5.75]},
    {'name': 'e-core performance', 'numbers': [4.93, 4.86, 4.89, 4.90, 4.98, 5.02, 4.87, 4.84, 4.87, 5.01]},
]

with open("output.md", "w") as output_file:
    output_file.write("|| Min | Max | Mean | Median | Standard deviation |\n")
    output_file.write("|---|---|---|---|---|---|\n")
    for row in data:
        numbers = row['numbers']
        minimum = min(numbers)
        maximum = max(numbers)
        mean = statistics.mean(numbers)
        median = statistics.median(numbers)
        std_dev = statistics.pstdev(numbers)
        output_file.write(f"| {row['name']} | {minimum} | {maximum} | {mean} | {median} | {std_dev} |\n")

8 vcpus assigned to dom0, all appvm(qube) set to p-core on i7-13700k (8 Pcores, 8 Ecores):

Min Max Mean Median Standard deviation
dynamic ondemand 4.24 4.47 4.359 4.39 0.07712976079309458
dynamic performance 4.25 4.65 4.428 4.4 0.12647529403009905
e-core ondemand 5.62 6.03 5.829 5.775 0.1494289128649473
e-core performance 4.84 5.02 4.917 4.895 0.06165225056719332
1 Like

I think you get the lowest possible time by using 4 p-cores for dom0 and the other 4 p-cores for the test vm, all running in performance mode.

The E cores are Intel Atom CPUs, they are designed to have a very low power consumption, they are not designed for high performance. 1 E core does not give you the same performance as 1 P core, when if they are running at the same clock speed. I donā€™t know what the ratio for the Alder Lake cores are, but Iā€™ve read that standard Pentium cores were +50% faster than Atom cores at the same clock speed.