NUMA turned off; only one node available, 1/2 my cores are ignored

danny · September 17, 2024, 7:15pm

I have an HP Z840 workstation with dual Xeon E5-2699A v4 CPUs. This configuration has 22 cores and 44 threads per socket, for a total of 44 cores / 88 threads.

However, in Qubes, in the dmesg output it prints:

user@dom0 ~]$ sudo dmesg | grep -E -i '(numa|smp)'
[    0.000000] Linux version 6.6.48-1.qubes.fc37.x86_64 (mockbuild@065a31b3c1ba4c34ab3938416488814f) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Wed Sep  4 01:09:59 GMT 2024
[    0.211041] NUMA turned off
[    0.957372] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.963574] smpboot: Allowing 44 CPUs, 0 hotplug CPUs
[    1.834257] Freeing SMP alternatives memory: 48K
[    1.837944] smp: Bringing up secondary CPUs ...
[    1.897703] smp: Brought up 1 node, 44 CPUs
[    1.897708] smpboot: Max logical packages: 1

Later on it prints:

[    2.210349] APIC: NR_CPUS/possible_cpus limit of 44 reached. Processor 44/0x1 ignored.
[    2.210352] ACPI: Unable to map lapic to logical cpu number
[    2.210583] APIC: NR_CPUS/possible_cpus limit of 44 reached. Processor 45/0x3 ignored.
[    2.210585] ACPI: Unable to map lapic to logical cpu number

with the series proceeding through processor 87/0x79.

NUMA is definitely enabled in the BIOS, as is hyperthreading.

Is this normal with Qubes? I only really look at the dmesg output when I have a problem. I have been having big problems with this workstation recently. I’ve been running this rig for almost a year with no issues, but in the last couple of weeks it has started to completely freeze up periodically, with each freeze ranging from a few seconds to what feels like an eternity but is probably still less than a minute. I suspected my GPU at first because at the time of the freezes I saw a lot of messages like Fence fallback timer expired in ring gfx and Fence fallback timer expired on ring sdma0. I replaced the video card with no improvement. What did eventually solve my issue is backing up my Qubes, secure erasing the NVMe drive, installing Qubes from scratch and restoring my Qubes. I completed that process this morning and so far so good (knock wood).

The reason I’m bringing up this problem is that it’s the reason I was looking at dmesg in the first place. I could not testify in court that I was seeing all of my CPUs even before I started having these problems.

I opened a text console while the Qubes installer was running and looked at the dmesg there. Interestingly, it also printed NUMA turned off, although at that time it still listed all 88 “cores” (44 cores, 88 threads). After booting into Qubes, it was back to limiting me to 44 “cores” and one NUMA node.

Is there something intrinsic to Qubes / Xen that is doing this? I don’t think so because I had a Z820 workstation before this one that also supported NUMA. I ran that machine for years and if I wasn’t getting all of my cores I am pretty sure I would have noticed.

apparatus · September 17, 2024, 7:22pm

Qubes OS is disabling the hyperthreading (SMT) by default.
Did you enable SMT specifically in GRUB config?

github.com/QubesOS/qubes-issues

Kernel update sets smt=off when smt=on is set

opened 07:10PM - 29 Nov 23 UTC

closed 09:39PM - 29 Nov 23 UTC

renehoj

T: bug R: not applicable P: default

### Brief summary Kernel updates forcefully disable smt, even if you have set t…he smt=on option. This becomes a problem with features like cpupools, if you are expecting all cores to be usable. Cpupools with credit(1) scheduler seems to crash xen, if the pools are using all the cores and smt gets disabled. ### Steps to reproduce Set smt=on in /etc/default/grub Update the kernel with the Qubes OS update tool. ### Expected behavior The system remains configured with smt=on ### Actual behavior The last line in /etc/default/grub set smt=off, and the system is no longer configured with smt=on

danny · September 17, 2024, 7:50pm

Sorry, I realize what I said was ambiguous. When I said I enabled it, I meant in the BIOS. I didn’t do anything to grub.

Thanks for that information, I didn’t know Qubes turned off SMT.

I found this page: [Xen-users] NUMA turned off after building & installing Xen 4.5.1 on Debian 8 amd64

Evidently I guess the NUMA turned off line is misleading in a Xen context because I ran xl info -n as outlined on that page and it printed this, which is accurate in terms of the number of nodes and the memory available.

numa_info              :
node:    memsize    memfree    distances
   0:    131840      91236      10,21
   1:    131072      88964      21,10

Thanks.

danny · September 17, 2024, 8:14pm

I read this page: Xen on NUMA Machines - Xen

And I ran xl vcpu-list which printed output like this:

Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
...
sys-firewall                         5     0   62   -b-      12.9  all / 0-43
sys-firewall                         5     1    0   -b-      11.9  all / 0-43
sys-whonix                           6     0   40   -b-      23.6  all / 44-87
sys-whonix                           6     1   76   -b-      26.4  all / 44-87
...

That document indicates that Xen uses CPU affinity to try to ensure that all memory is local to the physical CPU … given that, the CPU affinities listed above certainly make sense and I guess it really is working just fine.