CPU Pinning Alder Lake

I’m a little confused as to what you mean? Are you explaining the @noscb script?

No.

Or are you already using a modified one?

Yes.

The updated program is handling both ‘domain-pre-start’ and ‘domain-start’ events; allowing it to:

  • pin qubes to P-cores almost immediately after qvm-start in order to speed up startup
  • pin qubes that don’t have the ‘performance’ tag set to E-cores after qrexec becomes functional (or after an additiona delay).

You can obviously adapt the script’s logic to fit your needs.

[edit: as a side note, the script could also be adapted to set the governor to ‘performance’ only while qubes are starting]

There is one thing I don’t understand. The script handles the domain-start event, which happens when qube starts. Does it also intercept the assignment of cpus? Does qube manage to start on the cpus assigned to xen, or is it already initializing on the assigned cpus from the script using xl vcpu-pin? It’s just that xl vcpu-pin only works for the running domain and it’s not clear to me. Can you test qube startup speed with and without your modified script? Pre-assigning them to e core, removing the p core from the pool so qube can’t use it to run. This will help to check which CPU vm is running on when comparing startup speeds and see if there is any effect on startup speed when using the script.

I’ve done some testing on the power consumption
6 hours ondemand: 0.735 kWh
6 hours performance: 0.985 kWh

FWW I’ve made some measurements with a multimeter so I can mostly comment on short-term power consumption:

  • expectedly, the consumption of performance vs ondemand with identical cpu-intensive loads was the same (obviously higher than when idling).
  • however, unexpectedly, the consumption when the PC was idling/low usage (typing emails, …) was exactly the same for performance/ondemand/powersave, at 69.5W - not even slightly changing over a few minutes. Note: I have only one NVMe drive and an additional PCI-E USB controller; no external GPU; 650W 80platinum PSU.

So basically:

  • switching from performance to powersave according to IdleHint doesn’t make sense for me, except if I have background jobs running; and even then it could be more economical to “race to completion” rather than run a job at the lowest frequency for a longer time.
  • my crude startup measurement of performance vs ondemand shows a 10-20% improvement, which according to your long term measurements requires up to 30% more energy. But at the same time you should be able to “pack” 10-20% more work into the same period of time because programs/GUIs load faster so it’s not clear-cut.
  • maybe Raptor Lake have better power optimizations than Alder Lake.

My gut feeling is that it’s overengineering the solution if you are trying to put the P cores into powersave, when you detect that a vm is idle. You don’t know what is running on the same cores, the cores would be shared with other vms.

If you detect a given application isn’t running in a vm you could move it to the E-cores, eg. if you close Thunderbird and you email qube is stil on, it’s running and doing nothing until you start Thunderbird again, you could move it to the E cores.

IdleHint in dom0 means that the user is away. It’s not about detecting that a given app in a given vm is idle. My rationale was that there would be no need for ‘performance’ when I’m not in front of my pc, in which case reverting to ‘ondemand’ or ‘powersave’ would have been more economical.

But measurements showed this wasn’t needed because the machine consumed exactly the same amount of power as ‘ondemand’ (or ‘powersave’) when left idling.

I’ve also done a few qube startup time measurements - I’ll post that in an instant - which show that ‘user’ pinning of qubes’ vcpus has little benefit - if any.

In the end, the tweaks that made a noticeable difference were switching to ‘performance’ and unlike others, pinning dom0 to E-cores (thanks for those hints btw).

The script handles the domain-start event, which happens when qube starts

A ‘domain-start’ event is received when the qube has started. Or more precisely, when qrexec has been set up, indicating a successful boot.

Does it also intercept the assignment of cpus?

It doesn’t ‘intercept’ anything. When you run qvm-start:

  1. xen starts the domain on whatever available core(s). Unlike xen-user.xml / cpupools the script can’t do anything at that time because there are no qube events received yet; and anyway the domain doesn’t even show up in xl list while it’s being set up by xen.
  2. ~0.5s to up to 1s later, the script receives a ‘domain-pre-start’ event and pins the vcpus according to whatever you’ve configured it to do. In my case, it pins the qube’s vcpus to all P-cores.
  3. some time later, eg. ~4s for a debian-x11-minimal template, the script receives a ‘domain-start’ event; it does nothing if the qube has a ‘performance’ tag (because the qube is already pinned to P-cores), otherwise it pins the qube’s vcpus to E-cores, after an optional delay to let heavy apps like firefox to complete starting up.

AFAIU the only difference between what the script does and xen-user.xml / cpupools is the very short amount of time (<1s) between xen domain setup and receiving a ‘domain-pre-start’ event, during which vcpus might not be pinned exactly where you’d want them to.

Can you test qube startup speed

Sure, I had planned to do some measurements later today but here goes; 4 iterations, starting debian-11-minimal, dom0 pinned ‘1:1’ do E-cores:

  • On an idle machine: with the script: ~4.8s ; without: ~4.8s ; with vpcus pinned to E-cores: ~5.8s

  • On a heavily loaded machine (all cores at 100%): with the script: ~16.5s ; without: ~16s ; with vpcus pinned to E-cores: ~20.2s

  • With E-cores at 100%, P-cores idling: with the script: ~16.3s ; without: ~16.5s ; with vpcus pinned to E-cores: ~19s

    Interestingly I would have expected startup times to be much lower when P-cores were idling and E-cores were at 100%; that could be because the CPU throttles to stay within a given thermal enveloppe.

In conclusion: Xen is doing a good job at dynamically balancing/pinning vcpus to physical cores, to the extent that for general use, user pinning (eg. with the script) doesn’t help - or could even make things worse. YMMV.

3 Likes

You did a great job, thank you👍🏻

1 Like

The three tests use p-core for dom0 and assign p-core/e-core to qube initially with xen-user.xml:

Min Max Mean Median pstdev
p-core 3.03 3.29 3.16 3.17 0.07
e-core/used script 3.66 3.99 3.85 3.86 0.09
ecore/without used script 3.72 4.19 3.95 3.97 0.14

What happened is what I doubted above - I don’t really understand how pinnig will work with xl vcpu after starting vm. It just doesn’t have time to assign the p-core to qube in time for it to start working with it. From my tests, you can see that when using the script, qube continues to use the vcpus assigned to it in xen-user.xml, or those assigned to it dynamically.

Perhaps the domain gets on the xl list too late.

I have changed my system to using cpupools.

My log-in script that sets up the pools, my qubes autostart start script starts 10 qubes which is why I don’t remove all the p cores at the start, also why I have the 20 sec delay before I take the CPU out of performance mode.

#!/usr/bin/sh

/usr/sbin/xl cpupool-cpu-remove Pool-0 8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
/usr/sbin/xl cpupool-create name=\"pcores\" sched=\"credit2\"
/usr/sbin/xl cpupool-cpu-add pcores 8,9,10,11,12,13,14,15
/usr/sbin/xl cpupool-create name=\"ecores\" sched=\"credit2\"
/usr/sbin/xl cpupool-cpu-add ecores 16,17,18,19,20,21,22,23

/usr/sbin/xl cpupool-migrate sys-net ecores
/usr/sbin/xl cpupool-migrate sys-net-dm ecores
/usr/sbin/xl cpupool-migrate sys-usb ecores
/usr/sbin/xl cpupool-migrate sys-usb-dm ecores
/usr/sbin/xl cpupool-migrate sys-firewall ecores
/usr/sbin/xl cpupool-migrate sys-vpn ecores

/home/user/.scripts/qubes-autostart.sh

sleep 20

/usr/sbin/xl cpupool-cpu-remove Pool-0 4,5,6,7
/usr/sbin/xl cpupool-cpu-add pcores 4,5,6,7

/usr/sbin/xenpm set-scaling-governor ondemand

/usr/sbin/xenpm set-scaling-governor 0 performance
/usr/sbin/xenpm set-scaling-governor 1 performance
/usr/sbin/xenpm set-scaling-governor 2 performance
/usr/sbin/xenpm set-scaling-governor 3 performance

exit 0

The Python script changed from pinning to migrate

#!/usr/bin/python3.8

import asyncio
import subprocess
import qubesadmin
import qubesadmin.events

pools = []
pools.append(dict(name="sys-", cpupool="ecores"))
pools.append(dict(name="debian-", cpupool="ecores"))
pools.append(dict(name="kicksecure-", cpupool="ecores"))
pools.append(dict(name="disp-mgmt", cpupool="ecores"))
pools.append(dict(name="user", cpupool="pcores"))
pools.append(dict(name="default", cpupool="pcores"))

def _vcpu_pin(name, pool):
    cmd = ['xl', 'cpupool-migrate', name, pool]
    subprocess.run(cmd).check_returncode()

def pin_by_tag(vm, event, **kwargs):
    vm = app.domains[str(vm)]
    for pool in pools:
        if vm.name.startswith(pool['name']):
            break
        elif pool['name'] == "default":
            break

    _vcpu_pin(vm.name, pool['cpupool'])
    print(f"Pinned {vm.name} to cores ({pool['cpupool']})")
    if str(getattr(vm, 'virt_mode')) == 'hvm':
        _vcpu_pin(vm.name+'-dm', pool['cpupool'])
        print(f"Pinned {vm.name}-dm to user-cores ({pool['cpupool']})")

app = qubesadmin.Qubes()
dispatcher = qubesadmin.events.EventsDispatcher(app)
dispatcher.add_handler('domain-start', pin_by_tag)
asyncio.run(dispatcher.listen_for_events())
1 Like

I like your method, the only thing I think is that you are doing a double migration for sys-*.
I personally don’t use cpupool yet, I want to work on this solution this weekend, but I’m leaning towards the solution of leaving in pool-0 the processors that will pre start vm and dom0. Then perform pinning on the desired vm.

So far I’ve been using this script for pinning:

script
#!/usr/bin/env python3

import asyncio
import subprocess

import qubesadmin
import qubesadmin.events

qube_name = {
    'vault': '16-23',
    'sys-net': '16-23',
    'sys-firewall': '16-23', 
    'sys-usb': '16-23',
}

def _vcpu_pin(name, cores):
    cmd = ['xl', 'vcpu-pin', name, 'all', cores]   
    subprocess.run(cmd).check_returncode()

def pin_by_name(vm, event, **kwargs):
    vm = app.domains[str(vm)]
    if vm.name in qube_name:
        cores = qube_name[vm.name]
        if str(getattr(vm, 'virt_mode')) == 'hvm':
            _vcpu_pin(vm.name+'-dm', cores)
        _vcpu_pin(vm.name, cores)
        print(f'Pinned {vm.name} to cores {cores}')
    else:
        print(f'No pinning for {vm.name}')

app = qubesadmin.Qubes()
dispatcher = qubesadmin.events.EventsDispatcher(app)
dispatcher.add_handler('domain-start', pin_by_name)
asyncio.run(dispatcher.listen_for_events())
1 Like

Have you made measurements with all those tweaks, and without (leaving the governor aside)?

As mentioned in a previous post, pinning starting qubes to P-cores didn’t improve anything on my hardware/setup, it was best to let xen do its stuff. The only thing that improved startup times was setting the governor to ‘performance’. “1:1” pinning of dom0 vcpus to E-cores made startup times much more consistent - but didn’t lower them as a whole.

Since you guys go to great length micro-managing vcpu pinning surely there must be a real improvement so I’m wondering what I’m missing.

I’ve only run the test script once, it seems to be working the start times are as expected for the vm + dom0 on p-cores in performance mode.

The pinning is to control when the E cores are used, if the cores were homogeneous it would be fine to lets Xen do its stuff.

1 Like

I noticed that the startup speed is better when dom0 and vm are connected to the p-core, and worse when dom0 and vm are on the e-core (on demand). In each combination you will get a different startup speed:

  1. dom0 p-core/ vm e-core
  2. dom0 e-core/ vm p-core
  3. dom0 e-core/vm e-core
  4. dom0 p-core/vm p-core
  5. dynamic affinity

Of all the options, the best results can be achieved with a 1:1 p-core. The speed between e-core(performance) and p-core on demand was almost the same in my measurements (if it helps anyone), but the 1:1 p-core(performance) will be the leader of the measurements.

As for not noticing any difference, try running the tests without a script trying to run vm on the p-core. Attach the processors using xen-user.xml. The script gives the false impression that you are running vm on p-core, it cannot set the processors in time. This means that your vm keeps running by assigning cpus dynamically or from xen-user.xml if it is present.

Wouldn’t it be reasonable to have a default “prefer p-cores for dom-U until we run out of them, add e-cores when needed, pin dom0 to e-cores”? How could it be implemented? Is the userspace daemon fast enough to react?

I don’t think SMT works with the credit2 scheduler when the cores move from pool0, but it seems to work with credit.

Credit2 should work with hyper-threading, but it might be an issue with P and E cores, all the cores gets added to the pool but only the even cores are used, with credit both the odd and even are used.

It’s possible to use qrexec to allow qubes to request to be moved to P cores, which is useful if you want to keep qubes on E cores until you actually use them.

/etc/qubes-rpc/policy/qubes.PCores

$anyvm dom0 allow

/etc/qubes-rpc/qubes.PCores

#!/bin/sh
/usr/sbin/xl cpupool-migrate $QREXEC_REMOTE_DOMAIN pcores

Then you can use qrexec-client-vm dom0 qubes.PCores in a qube, and it will be moved to pcores.

It’s possible to change the desktop exec tag to automatically move the vm when a program is started.

Exec=bash -c ‘qrexec-client-vm dom0 qubes.PCores&&blender’

This would allow you to move the vm when you start blender.

offtopic (@renehoj - would have PM’ed you to avoid polluting this thread but your profile is locked): no idea if you have the option to enable C1E state in dasharo but on the MSI factory bios this option is disabled by default and enabling it shelved 10W without adverse effect on startup times and performance as a whole.

I don’t think you can change the cstate with Dasharo.

If you are using cpupool it also doesn’t seem like you can use cstates beyond C0 and C1, which makes it kinda pointless.

All in all, cpupools just seem to have a lot of issues. The documentation doesn’t mention any of the negative side effects, I found a note saying that reconfiguring the CPU will disable some features, don’t know if that is what comes into play when you move the cores between pools.

1 Like

I was wrong about cstates not working with cpupools.

It seems to be a result of pinning dom0, you can only use xenpm to read the idle state of the cores dom0 is using. I don’t know if this means the cores can’t use C2, but I think it just means you can see if the cores are in C1 or C2 state, and both states are reported as C1.

For me xenpm get-cpuidle-states 1 shows the same info (C0/C1/C2) whatever P or E cores dom0 is pinned to; eg:

All C-states allowed

cpu id               : 1
total C-states       : 3
idle time(ms)        : 13978409
C0                   : transition [             4714155]
                       residency  [              266071 ms]
C1                   : transition [             2211945]
                       residency  [              836535 ms]
C2                   : transition [             2502210]
                       residency  [            13081558 ms]

I remember also trying to add xen’s cpuidle boot option but it didn’t change anything.

Yes, it doesn’t matter if it’s P or E cores, but it’s only the cores dom0 is using that has C2, all other cores only have C0 and C1.

If you remove the pinning and run dom0 and all cores, they all get C2.