That topic you linked to doesn’t mention a firmware bug. And since Unman is affected by the sluggnishness as well, and he has the experience to notice the firmware bug error message, if there was one on his system.
So it could be that the additional delays I and them are experiencing aren’t caused by the firmware bug. Meaning it could be two separate problems.
So if the additional delays aren’t caused by the firmware bug, then what does the apic mismatch actually cause?
That’s what we need an experienced qubes os engineer who understands the nuances and can write documentation for us about the error.
Is it an error that can be ignored? Or are there hidden problems/risks by not solving it? And how would it be solved?
Those are questions that the documentation for the firmware bug apic mismatch should answer, by an experienced official qubes os developer.
Add it to the docs, and let people following this topic know when it’s done.
I’ll be here and try to help if there’s anything more I can be useful for.
Indeed it could, or maybe they are, or there may be different causes for each. There may be multiple causes, or the delays may only be due to changes in Fedora, and outside the boundary of Qubes-OS.
I doubt that the alerts you see are causing your problems, but only you can do the test to try to eliminate them, and so to find out.
I started seeing this bug when I upgraded to version 4.3 (no problem with 4.2).
I mentioned it before, but since it didn’t seem to bother me and I had received no reply, I dropped it—lol.
From what I found in my research, it appears this bug has existed for a long time. However, the bug occurs on two of my laptops and I notice a heavier CPU usage than before even when idle and with very few VMs running. I also see a large discrepancy in CPU usage inside Xen: for example, right now my CPU in Dom0 is at 28% but in Xen it shows 34% at max frequency (I mean Xen, not Fedora — I know Fedora doesn’t display frequencies correctly). I only use Mullvad.
On my other laptop (admittedly older), without doing anything I’m still between 30% and 40% (even after a fresh install).
Anyway, I’m taking advantage of this thread to mention it, and since so far people have only talked about the bug without providing other info, here are the results of these two commands: sudo dmesg | grep -i apic:
[11326.107410] [Firmware Bug]: CPU 1: APIC ID mismatch. Firmware: 0x0002 APIC: 0x0001
[11326.109213] [Firmware Bug]: CPU 2: APIC ID mismatch. Firmware: 0x0004 APIC: 0x0002
[11326.110235] [Firmware Bug]: CPU 3: APIC ID mismatch. Firmware: 0x0006 APIC: 0x0003
[21155.374834] [Firmware Bug]: CPU 1: APIC ID mismatch. Firmware: 0x0002 APIC: 0x0001
[21155.375964] [Firmware Bug]: CPU 2: APIC ID mismatch. Firmware: 0x0004 APIC: 0x0002
[21155.376983] [Firmware Bug]: CPU 3: APIC ID mismatch. Firmware: 0x0006 APIC: 0x0003
[40582.480984] [Firmware Bug]: CPU 1: APIC ID mismatch. Firmware: 0x0002 APIC: 0x0001
[40582.482036] [Firmware Bug]: CPU 2: APIC ID mismatch. Firmware: 0x0004 APIC: 0x0002
[40582.483058] [Firmware Bug]: CPU 3: APIC ID mismatch. Firmware: 0x0006 APIC: 0x0003
[58448.546903] [Firmware Bug]: CPU 1: APIC ID mismatch. Firmware: 0x0002 APIC: 0x0001
[58448.547970] [Firmware Bug]: CPU 2: APIC ID mismatch. Firmware: 0x0004 APIC: 0x0002
[58448.549033] [Firmware Bug]: CPU 3: APIC ID mismatch. Firmware: 0x0006 APIC: 0x0003
and this one:
sudo dmesg | grep -i irq
[ 5.126711] xen: --> pirq=16 -> irq=16 (gsi=16)
[ 5.147898] xen: --> pirq=17 -> irq=17 (gsi=17)
[ 5.209254] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[ 5.212516] hpet_acpi_add: no address or irqs in _CRS
[ 5.467043] ata4: SATA max UDMA/133 abar m2048@0xd4436000 port 0xd4436280 irq 152 lpm-pol 3
[ 5.467415] i8042: PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
[ 5.469647] serio: i8042 KBD port at 0x60,0x64 irq 1
[ 5.469700] serio: i8042 AUX0 port at 0x60,0x64 irq 12
[ 5.469704] serio: i8042 AUX1 port at 0x60,0x64 irq 12
[ 5.469707] serio: i8042 AUX2 port at 0x60,0x64 irq 12
[ 5.469709] serio: i8042 AUX3 port at 0x60,0x64 irq 12
[ 43.384523] xen: --> pirq=18 -> irq=18 (gsi=18)
[11326.108555] cpu 1 spinlock event irq 126
[11326.109507] cpu 2 spinlock event irq 132
[11326.110545] cpu 3 spinlock event irq 138
[21155.375194] cpu 1 spinlock event irq 131
[21155.376250] cpu 2 spinlock event irq 137
[21155.377278] cpu 3 spinlock event irq 155
[40582.481356] cpu 1 spinlock event irq 131
[40582.482328] cpu 2 spinlock event irq 137
[40582.483358] cpu 3 spinlock event irq 155
[58448.547276] cpu 1 spinlock event irq 131
[58448.548256] cpu 2 spinlock event irq 137
[58448.549334] cpu 3 spinlock event irq 155
Hope this could be helping (me and/or others)
edit: i don’t try to downgrade to Xen 4.17 to not break my system! lol
yes @OvalZero , i saw this page but il don"t understand with i “re see” it now
Nevermind, my consommation of cpu is, perhaps, normal or it’s another thing… forget it! and that’s for awsering
I may be totally lost here, but is this the solution to the OP? I checked sudo dmesg | grep -i apic on three different Qubes devices (r4.2 and r4.3) and didn’t find any mismatches. so without a link to the discussion (or lack thereof) among the devs, I’m not clear what conclusion I should be drawing from this thread.
As far as I understand, it depends on the BIOS because there are seemingly different ways to describe and read x86 topology. It seems that XEN expects things to be handled differently to the way that some firmware developers handle them. However (and again: educated guessing), it’s just a cosmetic “error”. As long as all cores come up (and SMT gets disabled) and APIC routing gets initialized everything is fine AFAIU.
I didn’t find that page when I searched for apic mismatch before. So that’s a great find!
I also think it’s frustrating that I can’t even read the text on the webpage without enabling JS. We shouldn’t need JS to read text. Maybe someone can copy paste the summary or abstract to here where everyone can read it without enabling JS in another domain.
But I am also not experienced enough to make any conclusion on how relevant that page.
What i’m thinking is that this apic mismatch is only relevant if using virtual machines.
But it makes a big difference if you are using xen or kvm or something else.
So if it’s the case that this is caused by a linux kernel bug (i don’t know if that’s correct),
then it still is worth to ask how Qubes OS handles that bug.
And also why this bug started to appear in 4.3 but not 4.2? What could have triggered that?
And that’s why I have been making a point that it’s important that an official qubes os developer writes the documentation for the error because they are most qualified.
I did this on one of mine (AMD, 4.3) and found the mismatch messages.
Looked in the BIOS and found SMT was set to “Auto”.
Changed it to “Off”.
No more mismatch messages, because the kernel finds it has the same number of cores after Xen/kernel has configured “smt=off” as the number announced by the BIOS.
I see no other change in performance or behaviour, except that my BIOS claims that S3 sleep is not available without SMT.
It is exactly as @renehoj suggested. There is not even a real bug - BIOS does the right thing, Xen does the right thing. They tell the kernel two different stories about the number of cores, but the kernel does the right thing anyway.
Unless there is some other reason to suspect a bug, then we can rest assured that our wonderful developers are doing just the right thing by ignoring these nuisance messages.
That’s a reasonable conclusion.
But I don’t agree with:
Because maybe the devs think it’s obvious, with the specialized experience developing qubes os.
But for everyone else, it’s not obvious, and there is no documentation.
If there is an error message, there should be documentation for it.
Otherwise, it’s only natural that the users will end up here asking about it.
And since it’s a OS made for security, users are unlikely to be comfortable having undocumented errors in their system.
I’m honestly shocked that I’m not seeing more support that error messages should have documentation, in a community of tech savvy users. It has felt like I’m fighting against the stream, that I’m the only one who wants documentation, which is the standard practice in IT. I hope I’m just misunderstanding the tone in some of the posts here because it has felt like I’ve faced resistance against wanting documentation.
Yes, Qubes OS makes it very secure to browse untrusted website.
But disabling JS is also about avoiding tracking, not only improving security by reducing attack surface.
Or perhaps it’s resistance to the tone of your own inferences, which seem to me both mistaken and belligerent. With nearly 2,000 open Github issues vying for dev attention, the likelihood of such documentation bubbling up to become any kind of dev priority seems remote.
Folk here can help confirm the diagnosis - that is the only thing stopping me from looking for the best place to submit a pull request to document these “APIC firmware bug” messages which look worrying but are apparently harmless.
@Tezeria , @plankretriever : up to now you are the only other people who mention seeing this. If you would like to help:
Can you find the SMT or hyperthread item in your BIOS? How is it named? Is it enabled or “Auto”?
Does the message go away if you change it to Off or Disabled?
Does it change anything?
Is your system AMD or Intel? **
The function is not used by Qubes, so there should be exactly zero impact - if there IS a change, or if the item is already off, then that would be a real firmware bug.
@ephile and anyone else, who does not see these messages:
Do you already have SMT or hyperthread item set to Off in your BIOS? How is it named? Is it AMD or Intel? **
** Minor Notes
* The documentation about contributing documentation seems like it might also need a PR. I cannot see the “edit on Github” button, but maybe this is only temporary. There was some discussion about it, I remember.
** AMD/INTEL question is optional, only useful to help people find the item.
It’s certainly different. But the ‘underlying problem’ (i.e. handling a discrepancy between the actual and ‘announced topology’) seems to be the same. So again: It seems to be BIOS-related.
It’s an intel cpu which has hyperthreading and the bios doesn’t allow changing its setting.
Regarding why few users have reported it, we can only guess on the reason.
But first of all, is 2 users who reported it on this forum really that small?
It’s generally known that most users don’t report or speak up. So if 2 have reported the issue on this forum, then how many more are there that have chosen to not speak?
It’s also possible that the reason I noticed these error messages is because I also am affected by the sluggish 4.3 system which many other users have reported.
So because of that problem, the disk decryption screen is detoriated, no gui. So I click escape keyboard button and see all the firmware bug errors.
The point is, there could be many more users affected without knowing it.
What is the hypothesis?
Yeah it seems like it because some people who have hyperthreading enabled/auto have no firmware bug, while others do have the firmware bug.
But the question still remains why did this firmware bug appear in 4.3?
And what could it be that’s different about the bios which causes this bug for some.
From security vulnerability perspective, it seems harmless but the most qualified people have made no comment on that.
My concerns are no less reasonable than the one you just made.
Do you know about the concept of red team and blue team?
And that in software development there are developers who build, and there are QA testers who do criticism and point out potential risks.
In a security focused community, I would think my reasonable concerns would be more welcomed.
You have to admit that what I’m saying is reasonable. And I also admit that what you’re saying is reasonable, but your arguments aren’t bullet proof.
If the answers to the questions in topic is so obvious to the devs, why don’t they answer it if it’s so simple?
ADW was in the topic twice but didn’t answer.
And if they are so overloaded with important issues they have the prioritize, why is ADW choosing to answer beginner questions like this one: I am a new to QubesOS and need some guidence - #2 by adw
So you have to admit, although you have a point, there are also weaknesses to it.
And there are also weaknesses to my points.
That’s what security is about a lot. We have to look at the risks. It’s not always 109% or 0%. Sometimes it’s somewhere inbetween.
Again, it seems: Some BIOSes announce the topology in a way that some Xen versions can’t interpret as expected. Update either side and things possibly change. As long as all CPUs/cores are brought up and SMT is disabled and the APIC is initialised, everything is fine. (Hybrid CPUs not being handled properly by Xen aside. But that’s not a fundamental issue, since it only affects performance in certain situations.)
In a resource constrained environment, all QA concerns will necessarily be triaged. As someone who has successfully raised several issues, which were solved in a timely manner, despite my never joining Github, my 2 cents is to treat others with respect and carefully follow the guidelines if you want others to allocate their time and attention to a problem. Some of us are willing to be patient and address a reasonable problem in spite of unreasonable aggressiveness in tone, but I don’t think our (very productive) devs should ever have to exercise such patience with members of the community forum, hence the guidelines.