When enabling this patch it crash with the error I previously posted “Domain 4:Offset 0x000e:0x49090000 expands past register size (1)”, “xen_pt_config_reg_init: Offset 0x000e mismatch! Emulated=0x0080, host=0x49090000, syncing to 0x49090000”.
Still no idea of what is the solution to fix this issue, but “what is the issue” seems a bit clearer to me.
From the log a difference seems to appear between standard qubes & new xen.
The flag “PCI_BASE_ADDRESS_MEM_TYPE_64 0x04 " seems to be used. ( I see the type 0x04 in my custom build while on standard qubes os it seems to use PCI_BASE_ADDRESS_MEM_TYPE_32”. To be confirmed. Still no idea on what it means for the fix I need to do.
Update: This specific issue is fixed, I made some mistakes when upgrading the rpm spec for qubes-vmm-xen. Pci passthrough still doesn’t work, but it crash a bit later in the initialization steps. Speaking about “rdm check flag”, will try to learn what it is
Major update:
The libvirt error message was a bit misleading.
However the xen error message was quite explicit and directly suggested me to try to set the “permissive” attribute.
I posted this message from my custom qubes build, with xen 4.16.2, libvirt 8.9.0, qemu 7.1.
( A lot more work is still required: testing, a lot of testing. Cleaning the code, trying to reduce the size of the diff between my fork and the official qubes os. Rewriting the git commit history (don’t look at it, it was my try&die workflow ), and many other thing. But now I am certain that I will make it work as I want).
Marmarek recently submitted a PR to QubesOS/qubes-vmm-xen at github. The PR upgrades Xen version to 4.17-rc3, which I think is what next release of QubesOS will rely on.
On my asus x670 strix F + 7950x , I first need to add the “x2apic=false” in the kernel options to boot to qubes. For the TSC issue, the frequency found by the system is wrong.
In dom0, the TSC is calibrated to 4491.520 Mhz which is kind of correct (~ approximatly the frequency of the CPU. I need to read a bit more about TSC and why it try a static frequency on a CPU with dynamic frequency ).
In domU, the TSC is calibrated to 196Mhz, and printing “/dev/cpuinfo” it seems that the domU system believe that 7950X is running at 196Mhz. It is wrong, it will run unusuably slow.
A work around I found it to manually override the configuration file used by libvirt/xen to start a domU.
Copy the libvirt configuration file to the qubes directory to override the configuration used: cp /etc/virsh/libxl/DOMU_NAME.xml /etc/qubes/templates/libvirt/xen/by-name/
(create the directories if it doesn’t exist yet)
The in the xml search for the “clock” balise and force the TSC mode to “emulate” instead of “native”.
For a real fix for this issue, I have not idea yet.
I am not sure where is the issue, my first guess would be a bug in xen or libvirt.
It could also be a bug in the bios I think, a lot of things are broken in the bios
Hello, by default every vm is using the tsc clocksource (clocksource=tsc has recently be added by default in the kernel option).
After spending a bit more time on the issue, the root cause seems to be because the cpu information provided to the domU are wrong (cpu frequency).
From some chat on xen IRC with a maintainer,
so this is a massive rats nest with virt. By default, VMs are created to be migrateable, and that means no Invariant TSC feature. Guests work fine, but report wonky values
if you don’t plan to migrate the VM, you can set itsc=1 in your vm config file, and then the TSC clocksource ought to be happier
From my understanding QubesOS is already using invariant TSC with this option in libvirt <feature policy='require' name='invtsc'/>.
For the moment not any real progress on finding what is the thing that are broken.
By what is “invarient TSC” should be and my issue, I am asking myself if it is not the invarient TSC itself that is broken.
I am now doing a bit of reading:
Processor Programming Reference for AMD CPU, family 25 (0x19): https://www.amd.com/en/support/tech-docs?keyword=PPR
Invarient TSC is a feature of the CPU itself.
Qubes has never worked with a AMD cpu of family 25 before, can a bug specific to xen + family 25 + invarient TSC exist ?
Reading a bit the source code of xen, like this part xen/xen/arch/x86/cpu/amd.c at master · xen-project/xen · GitHub c->x86 is the CPU family. Ryzen 1 is family 0x17 (One of my computer is Ryzen 1 and it work perfectly with Qubes). So I am searching for suspicious things related to the CPU family for AMD cpu.
Still no new answer from ASUS support about the BIOS, except that the problem is a bit more complex than expected and that it will take more time to understand.
A lot of new things to learn
Some more tests:
On my ryzen 1 computer the policy <feature policy='require' name='invtsc'/> seems to have no influence, TSC is happy, /proc/cpuinfo is always correct. ( tried policy='require' and policy='disable')
On my ryzen 4, it also seems to have no influence, TSC is not happy, /proc/cpuinfo is always wrong
Modified the BIOS parameters a bit to try to see what it do. After modification, frequency reported in /proc/cpuinfo have been modified from 196Mhz to 205.166Mhz. Don’t know what specific parameter is responsible for that
Nice. I’ll probably switch to a zen 3 cpu in the near future (once zen4 starts pushing down zen3 (second hand) prices), but it’s nice to know zen4 support will be there, so thanks. Sadly even though AMD iirc is a partner of Xen, there seem to be a few issues with the speed at which they add actual support to the HV, plus xen isn’t exactly good about communicating about this sort of stuff.
For the moment no progress on my side.
From my IRC comment
For my issue, it seems to be a integer overflow. Somewhere there is a unsigned 32 bits integer storing the cpu frequency in Hz, this variable is responsible for passing the cpu frequency information to domU. When I downclock my CPU to below 4,294,967,295 Hz, the correct cpu frequency is passed to domU. After that it start back at 0 Hz. It explain why my domU is showing ~205 Mhz when my real CPU is running at ~4500Mhz. I am hunting for this integer to be switched to 64 bits integer. I am starting with the xen codebase, if someone have some hint on where to look specifically If not I will probably be able to find it, but going to take me a few days I think
Trying to understand “what does what” in the xen source, but it is going to take a while. Trying to find what is the part of the code that give the vcpu informations to a domU.
Also another issue for later, the tool “xenoprof” doesn’t support AMD family 25 ( explicit statement in the logs ).
Don’t remember if it is because of the things I tryied to patch or if it was because I never tested it, but “PV” work as expected with the correct cpu frequency.
Only the PVH and HVM are problematics.
I am not so sure now that the issue is in the xen hypervisor code base. Maybe it is in the linux kernel directly, in the xen specific part linux/arch/x86/xen at master · torvalds/linux · GitHub
Going to take some more time TT
Update: Another funny thing to note and to understand or fix later, when starting a PVH linux domU, the linux kernel understand it as being a HVM and not a PVH. This is already the case in a standard qubes on a supported hardware.
This line is printing “Hypervisor detected: Xen HVM” in case of Xen PVH.
Related code:
We see this global variable being reassigned:
just before calling “xen_pvh_domain()” which is defined as being a reading the global variable “xen_pvh”:
“CONFIG_XEN_PVH” is defined in the qubes kernel linux configuration, from what I see.
I don’t know if it is an issue or not, but it feel weird that when using a PVH the linux kernel explicitly state that he think it is a HVM.
Update2: The kernel later understand that it is a PVH. So nothing to see here.
Anyway, that was not what I was trying to debug. The rabbit hole is deep.
Did some more testing.
Tracked back the cpu frequency to here:
In case of PV mode (dom0 or guest):
tsc_shift = -2 ; tsc_to_system_mul: 3_824_888_891
In case of PVH or HVM mode:
tsc_shift = 3; tsc_to_system_mul: 2_730_337_484
The calculation done by pvclock_tsc_khz to determine the CPU frequency seems to be correct and without overflow. The input data (tsc_to_system_mul and tsc_shift) seems to be source of the issue.
More debug is needed to reach the source issue.
Difference between PVH and HVM mode:
In case of HVM, the CPU is correctly calibrated using the PIT method (correct frequency found using this method):
So calculated cpu frequency and tsc frequency are different
later in the code, the linux kernel prefere to use the tsc frequency instead of the cpu frequency.
That may explain why a Windows HVM guest work correctly and a linux HVM guest does not
UPDATE, more debug information:
Getting closer.
By applying thoses 3 lines (to reproduce the same behavior as PV in this function), PVH and HVM now start with the correct frequency. So getting way closer to the source issue.
For the PVH and HVM mode, the method void set_time_scale(struct time_scale *ts, u64 ticks_per_sec)
receive an incorrect value for “ticks_per_sec”
UPDATE 2
I think I found it:
“d->arch.tsc_khz” is a unsigned integer. The value expected by set_time_scale is a u64.
Since there is no cast from u32 to u64, when it get multiplied by 1000 (from KHZ to HZ), it overflow.
With explicit cast to u64 it should work.
Testing it. Going to take some hours.
UPDATE 3
I confirm that this is the source issue. I fixed it on my side, all seems to work as expected.
Now need to make a nice patch and speak with xen developer to integrate it
UPDATE 4
Patch normally sent to the xen-devel mailing list.
Copy here:
From c1535eba0bba6fc1b91f975f434af0929d9d7c96 Mon Sep 17 00:00:00 2001
Message-Id: <c1535eba0bba6fc1b91f975f434af0929d9d7c96.1671298409.git.xen@neowutran.ovh>
From: Neowutran <xen@neowutran.ovh>
Date: Sat, 17 Dec 2022 17:17:03 +0100
Subject: [Patch v1] Bug fix - Integer overflow when cpu frequency > u32 max value.
xen/arch/x86/time.c: Bug fix - Integer overflow when cpu frequency > u32 max value.
What is was trying to do: I was trying to install QubesOS on my new computer
(AMD zen4 processor). Guest VM were unusably slow / unusable.
What is the issue: The cpu frequency reported is wrong for linux guest in HVM
and PVH mode, and it cause issue with the TSC clocksource (for example).
Why this patch solved my issue:
The root cause it that "d->arch.tsc_khz" is a unsigned integer storing
the cpu frequency in khz. It get multiplied by 1000, so if the cpu frequency
is over ~4,294 Mhz (u32 max value), then it overflow.
I am solving the issue by adding an explicit cast to u64 to avoid the overflow.
---
xen/arch/x86/time.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index b01acd390d..7c77ec8902 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2585,7 +2585,7 @@ int tsc_set_info(struct domain *d,
case TSC_MODE_ALWAYS_EMULATE:
d->arch.vtsc_offset = get_s_time() - elapsed_nsec;
d->arch.tsc_khz = gtsc_khz ?: cpu_khz;
- set_time_scale(&d->arch.vtsc_to_ns, d->arch.tsc_khz * 1000);
+ set_time_scale(&d->arch.vtsc_to_ns, (u64)d->arch.tsc_khz * 1000);
/*
* In default mode use native TSC if the host has safe TSC and
--
2.38.1
Thanks for your continued work, judging by the amount of “hearts” on this thread there are several other people interested in this as well. It would not be an exaggeration to say I look at this a couple times a day to gauge the progress you and other have been making! thanks again.
Long story short - which Ryzen version is the highest that works perfectly (including its iGPU) with up-to-date current Qubes OS 4.1.1 (lets imaging user can install and update Qubes OS on different PC)? 5***, 4*** or what and how to select a Ryzen for this?
Is there any sense to buy 6*** or 7*** series at this point if user wants to make it work almost out of box on Qubes OS?
@balko this thread is about what need to be done to be able to use qubes os with a ryzen 7000 series.
I do not known the potentials issues of previous generation. However since at the moment, the xen hypervisor version used in stable release does not support cpu family 25, ryzen 7***, 6**** and 5**** should not work.
For the GPU passthrough:
On my old computer I have a RX580 that I can passthrough to a linux HVM for gaming.
I noticed that it seems there is a bug in the linux kernel for pci handling: The passthough work with lts kernel 5.4, but fail if I upgrade the kernel to 5.6.?+ (I can start the HVM but when I try to activate the GPU it fail with unhelpful error message) .
On my new computer, I restored the linux HVM. However, if I start it, it crash with kernel related error / memory violation
It is directly related to the gpu passthrough (If do not do the PCI passthrough, the HVM start correctly) .
If I upgrade the kernel to a newer version, I can start the HVM but end up with the same kernel bug as with my old computer
So there is at least 2 differents issues.
One of the issue is a regression in the linux kernel related to PCI handling, the regression was introduced around 5.6.X. This should be the easiest bug to find since I can reduce the scope by upgrading to newer kernel until I find which specific version introduced the bug and then try to find it in the commit / source code. But I expect it to be very time consuming, again (in the beginning of the process could use the distribution archives to speed up by not needing to compile everything).
For the second issue, I have no idea at the moment. Something related to qemu version ? related to the linux kernel used to launch qemu ? a xen dependencie in the VM that is not of the correct version ? Lot of testing required to reduce the possibilities. (Try with gpu passthrough, without, with but without strict reset. Try all of the above but with non gpu PCI device. Try different kernel version (since it is directly related to the linux kernel version used ))
Update
For the second issue it feel like it is related to the xen_blkfront and xen_blkback drivers in the linux kernel. Maybe that a xen hypervisor version require guest to have some specific version of the linux kernel. Anyway, won’t focus on this issue.
For the first issue, kernel log indicate (on my zen4 computer, HVM kernel is 6.0.12):
Thanks a lot for information, I’m just a bit overwhelmed with information about Ryzen on the forum (used Intel for Qubes OS for ages). But Ryzen due to its performance looks promising and tempting.
Will use Intel for some time more.
Thanks again, you work with Ryzen is very appreciated.