Is it safe to use hyper-threading (SMT) with Qubes OS if done the 'correct way'?

Raist · January 17, 2024, 4:20pm

I feel this may need to be a separate post so people can discuss it’s merits. It seems like a pretty big deal if you can use hyper-threading in Qubes without any negative downsides and is not the normal wisdom being facilitated around.

Raist · January 17, 2024, 9:38pm

I did some reading on this through the day, what I can surmise is that link has been brought up a couple times in the forums but without much discussion. It looks like it was originally posted in 2019 had some issues working correctly and marmarek stated it would be added once it wasn’t experimental. This was on Qubes 4.0 / Xen 4.13.

After that it seems the discussion dropped off for a couple years.

Sometime last year that thread was revived and a couple users did get it working. This was on Qubes 4.1. Per the Github thread.

Can’t find any info if someone has messed with this on Qubes 4.2 or Xen 4.17. However core scheduling is still listed as experimental for all Xen versions so it wouldn’t be something added officially anytime soon.

However the questions remain is DOES it work correctly on 4.2 and what are the tradeoffs if you do it yourself. If it provided any additional protection at all it would still be worth while I assume to those who have decided to turn on SMT anyways as I’ve seen some do.

that’s the most I could find.

renehoj · January 17, 2024, 10:01pm

sched-gran doesn’t work on CPUs that are asymmetric, like the Intel E and P core CPUs.

It also does solve all problems, transient HT exploits are still possible even if you change the scheduling granularity to core.

If you want to use smt then sched-gran=core is better than cpu, if you can use it, but it doesn’t make smt safe.

If smt is worth it depends on your threat model, and your work load, but if you need the highest level of security you shouldn’t enable smt, it’s never going to be safe.

deceiver · January 17, 2024, 10:21pm

I have it enabled on an Intel machine under that version,
doesn’t give me any problems. Performance benefit is great,
but more wary people ought to wait until it is “security-supported”
in the feature matrix.

Even more paranoid folks, read this.

Raist · January 17, 2024, 10:34pm

This was what i likely figured the case would be, that it does NOT fix the underlying issue of why Qubes doesn’t want to use it. The git thread should be “safer use of hyperthread…” not safe which implies to much.

thanks for confirming it works on 4.2 and your experience has been good.

unless others chime in it sounds like as with most custom things with Qubes, it’s still a threat model tradeoff/balancing act. Is it “safer” then just enabling SMT, sounds like it, but is not equal to the current default mitigation choice of being off.

fsflover · January 17, 2024, 10:48pm

Could you elaborate? Which threat is there when it’s enabled?

Bearillo · January 17, 2024, 11:06pm

Spectre-like attacks could still be used to compromise security barriers inside the same VM, which has processes running on virtual cores of the same physical core; e.g. if the user has removed passwordless sudo and applied other VM hardening measures; if e.g. a kernel-level process runs simultaneously to a user-space process on the same physical core using SMT this could be a problem…and most Xen exploits need root in the “origin VM”, so could even be used to attack dom0, though the standard Qubes security model assumes that this is exceedingly unlikely.

Raist · January 18, 2024, 1:05am

anyone know why cat /sys/devices/system/cpu/smt/active returns a value of 1 indicating that SMT is on but I’ve never enabled it and I have HT off in BIOS. I saw some info on lscpu that lead me to believe it was on and after diving further it looks like it is which makes no sense as it’s supposed to be globally disabled by default.

as writing this I used that command in an appvm qube and a templatevm qube and both of those return the expected lscpu values and also the expected SMT value of 0 for off. WTH is it about Dom0 that doesn’t return correct info or conflicting confusing info. Dom0 also list vuln that app/template do not, should Dom0 NOT be used when pulling info? I would’ve thought that Dom0 provides the most accurate information regarding hardware etc.

augsch · January 18, 2024, 5:32am

Xenpm can reliably tell you if smt is on. Run xenpm start 1 in dom0, and the tool will collect info about your CPU for 1 second. If half of the cores are missing information such as P-states statistics, then smt is correctly disabled.

augsch · January 18, 2024, 5:40am

If I enable smt, but use cpupool to ensure all app qubes only use one specific thread per core ( like using cpu0,2,4,6,8) on xen level, would it be more dangerous than simply disabling smt? In theory, nothing is processed on the other thread, so nothing can be leaked by covert channel.

I’d like to do so because some CPU do not correctly enter lower C-states if smt is disabled, which leads to higher power consumption and shortened battery life.

renehoj · January 18, 2024, 7:13am

@augsch I could just enable hyperthreading in the firmware, and use smt=off in Xen. It does the same thing, it prevents Xen from using the sibling cores, HT is still enabled the extra cores just don’t execute any code.

@fsflover Many of the vulnerabilities found in CPUs rely on SMT being enabled, disabling SMT greatly reduce the attack surface of the CPU.

Zeno · January 18, 2024, 8:34am

Isn’t 0,2,4,6,8 are physical cores, while threads are logical ones?

To my understanding, you would need to attach each and every physical core per qube, because SMT will be enabled globally, but even then each qube will have two threads that will share the control mechanism of how SMT handles cache etc.
Meaning: it can be exploited within the same qube.

Zeno · January 18, 2024, 9:04am

correctly in here doesn’t mean performance, but security benefits.
That’s the whole idea of this topic, to check if this works correctly – safe to use.

Raist · January 18, 2024, 2:42pm

below is what is returned, The P values do not look different between cores, I’m not sure how properly read this and come to a conclusion. If you wouldnt mind can you confirm if this looks like SMT is enabled/disabled? I’ve been getting a lot of weird info out of Dom0 lately that I can’t explain.

CPU0:	Residency(ms)		Avg Res(ms)
  C0	159	(15.93%)	0.26
  C1	7	( 0.71%)	0.06
  C2	129	(12.92%)	0.63
  C3	38	( 3.81%)	0.89
  C4	119	(11.89%)	1.39
  C5	103	(10.28%)	2.29
  C6	419	(41.80%)	3.64
  C7	26	( 2.66%)	6.67
  C8	0	( 0.00%)	0.00

  P0	148	(100.00%)
  P1	0	( 0.00%)
  P2	0	( 0.00%)
  P3	0	( 0.00%)
  P4	0	( 0.00%)
  P5	0	( 0.00%)
  P6	0	( 0.00%)
  P7	0	( 0.00%)
  P8	0	( 0.00%)
  P9	0	( 0.00%)
  P10	0	( 0.00%)
  P11	0	( 0.00%)
  P12	0	( 0.00%)
  P13	0	( 0.00%)
  P14	0	( 0.00%)
  P15	0	( 0.00%)
  Avg freq	1045000	KHz

CPU1:	Residency(ms)		Avg Res(ms)
  C0	183	(18.34%)	0.26
  C1	7	( 0.75%)	0.08
  C2	157	(15.70%)	0.52
  C3	25	( 2.51%)	0.72
  C4	142	(14.26%)	1.46
  C5	178	(17.84%)	2.32
  C6	296	(29.55%)	2.79
  C7	10	( 1.05%)	3.50
  C8	0	( 0.00%)	0.00

  P0	173	(100.00%)
  P1	0	( 0.00%)
  P2	0	( 0.00%)
  P3	0	( 0.00%)
  P4	0	( 0.00%)
  P5	0	( 0.00%)
  P6	0	( 0.00%)
  P7	0	( 0.00%)
  P8	0	( 0.00%)
  P9	0	( 0.00%)
  P10	0	( 0.00%)
  P11	0	( 0.00%)
  P12	0	( 0.00%)
  P13	0	( 0.00%)
  P14	0	( 0.00%)
  P15	0	( 0.00%)
  Avg freq	1121000	KHz

CPU2:	Residency(ms)		Avg Res(ms)
  C0	164	(16.42%)	0.26
  C1	0	( 0.06%)	0.06
  C2	165	(16.54%)	0.54
  C3	54	( 5.47%)	0.84
  C4	116	(11.62%)	1.15
  C5	159	(15.92%)	3.13
  C6	332	(33.16%)	3.02
  C7	7	( 0.80%)	3.99
  C8	0	( 0.00%)	0.00

  P0	153	(100.00%)
  P1	0	( 0.00%)
  P2	0	( 0.00%)
  P3	0	( 0.00%)
  P4	0	( 0.00%)
  P5	0	( 0.00%)
  P6	0	( 0.00%)
  P7	0	( 0.00%)
  P8	0	( 0.00%)
  P9	0	( 0.00%)
  P10	0	( 0.00%)
  P11	0	( 0.00%)
  P12	0	( 0.00%)
  P13	0	( 0.00%)
  P14	0	( 0.00%)
  P15	0	( 0.00%)
  Avg freq	1102000	KHz

CPU3:	Residency(ms)		Avg Res(ms)
  C0	165	(16.52%)	0.22
  C1	5	( 0.52%)	0.10
  C2	161	(16.14%)	0.45
  C3	64	( 6.43%)	1.07
  C4	133	(13.32%)	1.35
  C5	173	(17.31%)	2.41
  C6	298	(29.75%)	2.81
  C7	0	( 0.00%)	0.00
  C8	0	( 0.00%)	0.00

  P0	154	(100.00%)
  P1	0	( 0.00%)
  P2	0	( 0.00%)
  P3	0	( 0.00%)
  P4	0	( 0.00%)
  P5	0	( 0.00%)
  P6	0	( 0.00%)
  P7	0	( 0.00%)
  P8	0	( 0.00%)
  P9	0	( 0.00%)
  P10	0	( 0.00%)
  P11	0	( 0.00%)
  P12	0	( 0.00%)
  P13	0	( 0.00%)
  P14	0	( 0.00%)
  P15	0	( 0.00%)
  Avg freq	950000	KHz

CPU4:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU5:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU6:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU7:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU8:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU9:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU10:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU11:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU12:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU13:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU14:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

CPU15:	Residency(ms)		Avg Res(ms)
  Avg freq	950000	KHz

Socket 0
	PC1	0 ms	0.00%
	PC2	65 ms	6.53%
	PC3	379 ms	37.90%
	 Core 0 CPU 0
		CC1	0 ms	0.00%
		CC2	0 ms	0.00%
		CC3	36 ms	3.65%
		CC4	0 ms	0.00%
		CC5	0 ms	0.00%
		CC6	114 ms	11.38%
		CC7	539 ms	53.78%
	 Core 1 CPU 1
		CC1	0 ms	0.00%
		CC2	0 ms	0.00%
		CC3	23 ms	2.36%
		CC4	0 ms	0.00%
		CC5	0 ms	0.00%
		CC6	137 ms	13.67%
		CC7	475 ms	47.40%
	 Core 2 CPU 2
		CC1	0 ms	0.00%
		CC2	0 ms	0.00%
		CC3	52 ms	5.19%
		CC4	0 ms	0.00%
		CC5	0 ms	0.00%
		CC6	110 ms	11.02%
		CC7	491 ms	49.01%
	 Core 3 CPU 3
		CC1	0 ms	0.00%
		CC2	0 ms	0.00%
		CC3	61 ms	6.17%
		CC4	0 ms	0.00%
		CC5	0 ms	0.00%
		CC6	127 ms	12.73%
		CC7	461 ms	46.07%

Raist · January 18, 2024, 3:46pm

agree, we all know it will be better performance but does doing it the way per the main topic give real or imagined benefits to safety. Per responses above I can gather it absolutely does not equal the safety of the global default SMT=off, but does it provide a suitable middle ground? If it does provide a legitimate middle ground it should likely be the go to recommended method of enabling SMT when someone ask excluding unsupported processors.

Per Renehoj

It sounds like it’s better than nothing but it would be awesome if someone could give a deeper take on how much better. Renehoj also mentioned transient HT exploits are still possible Analogies - is it a door made of paper, it’s still a door but don’t expect it to matter if someone wants to walk in, therefore the effort isn’t worth the squeeze. Or is it a nice steel core storm door that while isn’t a bank vault it still worth while having if you’re looking to keep something out.

When it comes to security myself and I’m sure others want a more detailed take on the risk we take on by deciding to do such things. You can’t manage risks if you don’t understand the risks you are taking in the first place. So far Renehoj gives the most insight but as a newbie i get conused on things like “HT exploits” while I did Google it and understand what it is now, as a less knowledgable person on these matters like I’m sure many Qubes users are I have no clue where that exploit lands on the scale of importance. I’m sure some exploits are easier to do then others and can have different impacts. What I’m trying to say in super simplied terms is does SMT+corescheduling=take care of more vulnerabilities than it leaves and are the vulnerabilities that are left serious enough to still not consider doing it at all.

Rehehoj i appreciate you engaging in this topic. Can you confirm if the above example scenario is just if you have SMT on with no further mitigation or does that scenario also apply to SMT+corescheduling? Fslovers question leaves open interpretation if he was curious about threat left if SMT is tunred on in general or if SMT is turned on in the context of this topic so I just want to keep straight any exploits being discussed outside of using the main topics procedure.

Raist · January 18, 2024, 3:54pm

Of course soon as I had posted, I was able to find some additional info that answers a lot of these questions.

Core scheduling benefits/limitations explained in detail here for anyone curious.

This here is also a very good read on some of the challenges of preventing these attacks and theorized solutions

I think those two links should address most current (early 2024) questions about what it does or does not do unless anyone else has more to add.

Edit - additional good read from Intel of mitigation guidance while also explaining some of these vulns here

One notable part of this one is risk assessment which is provided below. Good info needed to make informed risk management decisions.

Risk Assessment as of 05/26/2020

Intel has not received any reports of real-world examples of these transient execution attacks being used to compromise system security. However, proofs of concepts that could gather data from unprivileged levels on unmitigated systems have been tested in controlled research environments. This includes data from the OS or from other applications that share certain hardware resources with the attacker.

A successful malicious actor has read only access to the data.

There is no privilege escalation by just using these techniques, meaning this actor cannot become a privileged user of the system.

Remote attacks using the CVEs listed at the beginning of this document are not possible.

When a malicious actor successfully locates an address with secret data, it has a short time window to gather it, as these issues arise as a result of speculative execution. This limits the rate at which data can be extracted. In the cases of MDS and TAA, only a few bytes can be extracted within a successful leak. This actor might need an extended period of time running on the same resources than the victim to collect meaningful information.

You need to carefully consider the type of data that it is processed in your systems and how this data is handled by the software. In systems where no secret data is present, it might still be theoretically possible to infer secret data from the OS.

augsch · January 18, 2024, 3:58pm

From my perspective, I think smt is enabled.

Thanks but as far as I know, CPU power management needs Xen to explicitly instruct cores to enter different P-states and C-states. If SMT is disabled on Xen level, then Xen will fail to do this to a certain extent. Based on my testing, on the premise of keeping SMT enabled in bios, if SMT is also enabled on Xen level, then all my cores can correctly enter C0~C3 states; if SMT is otherwise disabled on Xen level, then half of my cores will only reside in C0~C1 states ( the other half can still enter C0~C3 ), resulting power consumption to be 1 watt higher on idle.

Raist · January 18, 2024, 4:02pm

that’s what I was gathering as well, I think something in Dom0 or in general is jacked on my device per other issues I have that i cant explain, guess ill need to reload.

deceiver · January 18, 2024, 5:03pm

You are right, I didn’t take into account stuff like support
status, threat model, etc when I said that.

PeakUnshift · February 18, 2024, 9:50pm

Thanks everyone for gathering all these info! There is still something I’m not sure to understand.

SMT is enabled by default on other Linux OSes, let’s take Fedora as example. So it means Fedora is vulnerable to all these attacks mentioned. Does enabling SMT on Qubes OS makes it “at worst” as secure as Fedora? Given the fact that Qubes OS provides more security in many other ways.

Thanks!