I currently run an i9-12900k (with Qubes 4.1) and I would like to implement this but I’m not sure what’s the best thing to use currently. I tried the python script with vcpu-pin
but I’ve seen somewhere in the thread that it’s not applying the change in time. Is it better to use pools in that case?
What do you mean by “applying the change in time”?
I’ve been using pools with the admin api method, I’ve not had any issues with it.
I’m talking about this part of the thread:
So I don’t know if the python script using vcpu-pin
works or not.
I’ll try that then.
If you want to make sure the vms are created using p-cores, then I think pools are best. You know they will start in pool0, and you can control which cores you have in pool0.
Do you know answers to questions by @procShield posted on Jun 25? There was no reply. Pinning is still possible in 4.2, correct?
Pinning is also possible in 4.2, although the easiest method of xen-user.xml doesnt seem to work right now.
i personally changed now to admin events as discribed in the initial post which is as reliable.
Did not checkt for pools as they are not part of my use case
Very interesting thread!
If you don’t mind, I’ll add some observations from my machine (laptop, i7-1260P, which has 4 P-cores and 8 E-cores):
-
The reported paucity of C-states for some cores is also the case for my CPU, even though I’m still using the standard dynamic scheduling, i.e. no CPU pinning of any kind, and it is always the last 4 cores (so one E-core module), which have only C0 and C1 listed, while the others list C0-C3.
Interestingly this changes when turning SMT on, where all cores then have C0-C3, but sched-gran=core doesn’t work for me either (Xen 4.14 seems to not support Alder Lake yet with that feature), so I will leave SMT off.
Could it be that when SMT is off, Xen compensates by barring the 4 E-cores that it may have internally pinned dom0 to, from entering C2-3? -
For some reason xenpm shows P0 for all my cores as 2101 Mhz and P1 as 2100 Mhz…this is far below the specced max. freq. for both the E-cores (3.4 Ghz) and P-cores (4.7 Ghz) (both on Turbo). This didn’t change after switching to hwp. I realize that 2.1 Ghz is the base frequency, but Turbo is shown as enabled. Is this normal? I suppose I’d have to “catch” the core in Turbo mode so it would show me the P-states on Turbo…
-
After I made the switch to hwp as the scaling driver, I noticed that it now leaves the cores in P0 (highest perf) all the time, not scaling down at all, but it does use the C states to save power…with that I’m not even sure that it’s advisable to use hwp over the default, as power usage may be worse now, since the default driver also used the higher C-states, but put the cores into P7 on idle as well, so theoretically that should mean more energy savings, right? I guess what I wrote under point 2 could at least mean that Turbo is off, but so far, subjectively, it seems like it drains the battery a bit faster; hwp also doesn’t seem to use C3 much on those cores that have it listed…the cores are mostly on C2 so far with very little going on in the system.
If Xen starts supporting Alder Lake with core scheduling then I’ll definitely give CPU-pinning a shot, but right now I’m not sure that I’ll really benefit from it…I just set most service VMs to 6 and user qubes to 8 cores and when I need a lot of performance in a VM I will manually change it to 12 and so far that works quite well.
-
Yes, that is normal. Don’t know why xenpm reports the speeds it does, but I think it has to do with the E and P cores not having the same base and max clock speed. You can check the average speed, it should show the correct clock speed for both P and E cores.
-
hwp is the default in 4.2. When I researched the C and P states for Xen that was the recommended model, you spend a short amount of time at the highest performance level, and then enter the deepest deep possible relative to the cost of waking up. Reducing the clock speed doesn’t work the same way on hypervisors as it does on bare metal systems.
Thanks for the info! Using xenpm get-cpufreq-average I’ve been able to determine that Turbo doesn’t even work with the standard acpi driver I let a vm crunch pi in 16 simultaneous threads and with the standard driver it didn’t go beyond the maximum base frequency on any core…
It works with hwp, though, so even though I still think the standard one is a bit more power-saving friendly, I’m not willing to throw Turbo Boost out the window, especially since I’m already living without SMT…
I also crunched some numbers to compare the standard driver and hwp and how much relative time they spend in the C-states and hwp did allocate a somewhat larger portion of time to C-states, but my test wasn’t very controlled, I just used the laptop with relatively little stuff going on, so maybe in one case I did actually have more performance demand than in the other…another advantage of the standard driver seems to be that I can actually set different CPU governors, but hwp seems to only have 1 (“hwp-internal” as it’s called).
In any case there’s no obvious big downside to using hwp and the huge upside that it actually is able to use the CPU properly regarding Turbo Boost.
Just saw this from apparently the developer of hwp…it’s possible to modify the performance tendency of hwp with xenpm set-cpufreq-hwp energy-perf:N
(N being an integer 0-255).
I wrote a little script that allows toggling this easily:
#!/bin/bash
# valid values: 0-255, lower means more performance-leaning
PERF=64
BAL=128 # don't modify this value without uncommenting the second xenpm command in the first if-scope.
PS=223
CUR_VAL=$(xenpm get-cpufreq-para | egrep -m 1 -o energy_perf\ \[[0-9]+\] | tr -dc '0-9')
if [ "$CUR_VAL" -lt "$BAL" ]; then
xenpm set-cpufreq-hwp balance
#xenpm set-cpufreq-hwp energy-perf:$BAL
notify-send "CPU: Balanced"
elif [ "$CUR_VAL" -eq "$BAL" ]; then
xenpm set-cpufreq-hwp energy-perf:$PS
notify-send "CPU: Power saving"
else
xenpm set-cpufreq-hwp energy-perf:$PERF
notify-send "CPU: Performance"
fi
Critiques welcome, as I’m not really well versed in bash. (Edit: I modified this now, look at my next post below for why)
Testing this shows the following results for starting a new dispVM with Firefox, closing Firefox as soon as it’s fully loaded and waiting for the VM to complete shutdown again:
Performance: ~15 s
Balanced: ~16.5 s
Power saving: ~22 s
I also tested with the value 0, but it didn’t really seem to be showing a difference vs. 64 and in the linked post the dev says to use 64 for performance. Note that this is without any CPU pinning, though, just dynamic allocation of 8/12 cores to the VM by Xen, but I’ve run it multiple times and results seem to be pretty consistent.
Now that I know this, however, I will start thinking about experimenting with pinning after all, as one could script things like disp-Firefox to set the CPU to performance first, pin the VM to all cores, and then bring it down to power saving after launch and pin the VM to E-cores.
But there’s more to set up before that…
Is that better than using the native function?
You can use xenpm set-cpufreq-hwp balance/performace/powersave
to change the governor, and you can use it to change the CPU or individual cores.
I guess it depends on what one’s preferences are…I wrote the script to be triggered by a hotkey, but in case by “native function” you mean using the less granular profiles (balance/performance/powersave) instead of setting the energy_perf parameter, the difference seems to be two-fold:
-
The profiles change not just that parameter, but also the “min” and “max” values of the “configured limits”. This is about “Hardware P-State”, which is likely going to be related to, e.g. Intel Speedstep, but as the max is e.g. 60 for my P-cores and 34 for my E-cores, I’m not sure that this is actually describing different frequencies per se; it may be an internal parameter used in an algorithm to derive the actual P-state that will be used.
Setting “energy-perf” directly does not change “min” or “max” at all(which could be a problem with the above script…if the user changes the profile to anything but “balanced” and then uses the above script, then the “min” and “max” will remain at their extreme ends)see Edit…issue is fixed in new version of my script. -
The profiles are quite extreme: both the “performance” and “powersave” profiles will force min=max at (actual) max and (actual) min, respectively, while “balance” seems to be keeping both “min” and “max” at their actual values; “energy_perf” is also extreme with the profiles, being 0 for “performance”, 128 for “balance” and 255 for “powersave”.
So in the case of really teasing out maximum performance it may indeed make sense to use the profile “performance”, as that will make the cores remain near the highest base frequency, no matter if anything is currently happening or not (at pretty severe energy cost, though, especially for laptop users), while the “powersave” setting seems not very useful, because it will just overly hobble the CPU, making it severely underperform constantly. E.g., here is the updated test data, now including the “performance” and “powersave” profiles (just one run each and I have more stuff open than last time, so perf is worse for all settings):
Performance (profile): ~14.6 s
Performance (script): ~15.6 s
Balanced (script): ~17.3 s
Power saving (script): ~23.2 s
Powersave (profile): ~98 s (yes, really)
On powersave (profile) even typing in this browser is laggy
So it’s just way too extreme IMO on both ends, especially for laptops.
Also interesting is that upon reboot the values seem to be reset to a kind of fourth profile, which is almost “balanced” according to “min” and “max” (“min” values are a bit above 0, though, namely at 6 for P-cores and 4 for E-cores) and has an energy-perf value of 102 instead of 128.
So for now I’ll stick with my script, but when I later look into boosting startup time of VMs I may make use of the performance profile.
Edit: since the “fourth profile” that is loaded by default at boot doesn’t allow stepping down the cores all the way to minimum, and also to address the above mentioned potential issue with my script from the last post, I have modified the script in the last post to actually apply the “balance” profile when cycling through the options. This will make sure that “min” and “max” are left at the actual min and max of P-states once “CPU: Balanced” has been toggled, while the other settings still only change the energy-perf parameter.
I have been experiencing different issue with cpupools after upgrading my CPU to 13th gen, and Qubes OS to 4.2.0
Using credit2 scheduler for all pools results in this issue: [PATCH] xen/sched: fix sched_move_domain()
I didn’t have any issues with credit2 before the upgrade, but after I upgraded my system it would sometimes crash during shutdown.
Running pool0 as credit2 and all other pools as credit(1) will almost always crash my system when I move domains between pools.
Using credit(1) for all pools seems like the only configuration that currently is stable on my system, pool0 can be changed to credit with the sched=credit
option.
I’m experimenting with an i7-1260P and I may have found a bug in Xen (On Qubes OS 4.2-RC4)
I didn’t enable hyperthreading, instead of 16 logical cores there are only 12 reported
[root@dom0 ~]# xl cpupool-list
Name CPUs Sched Active Domain count
Pool-0 12 credit2 y 10
but, if I assign cores 13 to 16 to a VM, it’s working
[root@dom0 ~]# for i in 0 1 2 3 ; do xl vcpu-pin 9 $i 12-15 ; done
[root@dom0 ~]# echo $?
0
I investigated further after doing some benchmarks…
The cores 0 to 3 which should be P cores had identical results to cores 4 to 7 which should be E cores when hyperthreading is disabled. But the performance drop only when using 8 to 11 and even 12 to 15 are working and both these CPU sets give the same performance, that are less than 0 to 7.
So it seems Xen has an issue with this hybrid CPU and doesn’t really disable hyperthreading with smt=off (default on Qubes OS)?
maybe interesting to @marmarek as it could be a security issue?
The i7 1260P has 4P 8E cores, and E cores can’t hyperthread.
With SMT enabled you have 16 thread/logic cores 8P and 8E, without you have 12 4P and 8E.
You can always see the first 8 cores, with SMT disabled 1-3-5-7 are just not used.
Ahh, this make sense then! What happens if you assign these cores to a VM then?
Not sure, but my guess would be that it’s running in the sibling thread, e.g. 0-1 is on the same physical core, the same is true for 2-3, 4-5, 6-7.
This is interesting
[root@dom0 ~]# for i in 0 1 2 3 ; do xl vcpu-pin 9 $i 0,2,4,6 ; done
[root@dom0 ~]# for i in 0 1 2 3 ; do xl vcpu-pin 9 $i 1,3,5,7 ; done
libxl: error: libxl_sched.c:62:libxl__set_vcpuaffinity: Domain 9:Setting vcpu affinity: Invalid argument
Could not set affinity for vcpu `0'.
libxl: error: libxl_sched.c:62:libxl__set_vcpuaffinity: Domain 9:Setting vcpu affinity: Invalid argument
Could not set affinity for vcpu `1'.
libxl: error: libxl_sched.c:62:libxl__set_vcpuaffinity: Domain 9:Setting vcpu affinity: Invalid argument
Could not set affinity for vcpu `2'.
libxl: error: libxl_sched.c:62:libxl__set_vcpuaffinity: Domain 9:Setting vcpu affinity: Invalid argument
Could not set affinity for vcpu `3'.
Does xenpm show you all 16 logic cores, with 4 P cores disabled?
xenpm get-cpuidle-states
doesn’t report anything for CPU 1, 3, 5, 7
When set to automatic mode (aka no cpu pin done), does Xen pin random cores to the VM for its lifetime, or does it balance the workload across all available cores? (so it may run on E or P randomly if you run tests three time in a row)