Did some more testing.
Tracked back the cpu frequency to here:
In case of PV mode (dom0 or guest):
tsc_shift = -2 ; tsc_to_system_mul: 3_824_888_891
In case of PVH or HVM mode:
tsc_shift = 3; tsc_to_system_mul: 2_730_337_484
The calculation done by pvclock_tsc_khz to determine the CPU frequency seems to be correct and without overflow. The input data (tsc_to_system_mul and tsc_shift) seems to be source of the issue.
More debug is needed to reach the source issue.
Difference between PVH and HVM mode:
In case of HVM, the CPU is correctly calibrated using the PIT method (correct frequency found using this method):
So calculated cpu frequency and tsc frequency are different
later in the code, the linux kernel prefere to use the tsc frequency instead of the cpu frequency.
That may explain why a Windows HVM guest work correctly and a linux HVM guest does not
UPDATE, more debug information:
Getting closer.
By applying thoses 3 lines (to reproduce the same behavior as PV in this function), PVH and HVM now start with the correct frequency. So getting way closer to the source issue.
For the PVH and HVM mode, the method void set_time_scale(struct time_scale *ts, u64 ticks_per_sec)
receive an incorrect value for “ticks_per_sec”
UPDATE 2
I think I found it:
“d->arch.tsc_khz” is a unsigned integer. The value expected by set_time_scale is a u64.
Since there is no cast from u32 to u64, when it get multiplied by 1000 (from KHZ to HZ), it overflow.
With explicit cast to u64 it should work.
Testing it. Going to take some hours.
UPDATE 3
I confirm that this is the source issue. I fixed it on my side, all seems to work as expected.
Now need to make a nice patch and speak with xen developer to integrate it
UPDATE 4
Patch normally sent to the xen-devel mailing list.
Copy here:
From c1535eba0bba6fc1b91f975f434af0929d9d7c96 Mon Sep 17 00:00:00 2001
Message-Id: <c1535eba0bba6fc1b91f975f434af0929d9d7c96.1671298409.git.xen@neowutran.ovh>
From: Neowutran <xen@neowutran.ovh>
Date: Sat, 17 Dec 2022 17:17:03 +0100
Subject: [Patch v1] Bug fix - Integer overflow when cpu frequency > u32 max value.
xen/arch/x86/time.c: Bug fix - Integer overflow when cpu frequency > u32 max value.
What is was trying to do: I was trying to install QubesOS on my new computer
(AMD zen4 processor). Guest VM were unusably slow / unusable.
What is the issue: The cpu frequency reported is wrong for linux guest in HVM
and PVH mode, and it cause issue with the TSC clocksource (for example).
Why this patch solved my issue:
The root cause it that "d->arch.tsc_khz" is a unsigned integer storing
the cpu frequency in khz. It get multiplied by 1000, so if the cpu frequency
is over ~4,294 Mhz (u32 max value), then it overflow.
I am solving the issue by adding an explicit cast to u64 to avoid the overflow.
---
xen/arch/x86/time.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index b01acd390d..7c77ec8902 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2585,7 +2585,7 @@ int tsc_set_info(struct domain *d,
case TSC_MODE_ALWAYS_EMULATE:
d->arch.vtsc_offset = get_s_time() - elapsed_nsec;
d->arch.tsc_khz = gtsc_khz ?: cpu_khz;
- set_time_scale(&d->arch.vtsc_to_ns, d->arch.tsc_khz * 1000);
+ set_time_scale(&d->arch.vtsc_to_ns, (u64)d->arch.tsc_khz * 1000);
/*
* In default mode use native TSC if the host has safe TSC and
--
2.38.1
Thanks for your continued work, judging by the amount of “hearts” on this thread there are several other people interested in this as well. It would not be an exaggeration to say I look at this a couple times a day to gauge the progress you and other have been making! thanks again.
Long story short - which Ryzen version is the highest that works perfectly (including its iGPU) with up-to-date current Qubes OS 4.1.1 (lets imaging user can install and update Qubes OS on different PC)? 5***, 4*** or what and how to select a Ryzen for this?
Is there any sense to buy 6*** or 7*** series at this point if user wants to make it work almost out of box on Qubes OS?
@balko this thread is about what need to be done to be able to use qubes os with a ryzen 7000 series.
I do not known the potentials issues of previous generation. However since at the moment, the xen hypervisor version used in stable release does not support cpu family 25, ryzen 7***, 6**** and 5**** should not work.
For the GPU passthrough:
On my old computer I have a RX580 that I can passthrough to a linux HVM for gaming.
I noticed that it seems there is a bug in the linux kernel for pci handling: The passthough work with lts kernel 5.4, but fail if I upgrade the kernel to 5.6.?+ (I can start the HVM but when I try to activate the GPU it fail with unhelpful error message) .
On my new computer, I restored the linux HVM. However, if I start it, it crash with kernel related error / memory violation
It is directly related to the gpu passthrough (If do not do the PCI passthrough, the HVM start correctly) .
If I upgrade the kernel to a newer version, I can start the HVM but end up with the same kernel bug as with my old computer
So there is at least 2 differents issues.
One of the issue is a regression in the linux kernel related to PCI handling, the regression was introduced around 5.6.X. This should be the easiest bug to find since I can reduce the scope by upgrading to newer kernel until I find which specific version introduced the bug and then try to find it in the commit / source code. But I expect it to be very time consuming, again (in the beginning of the process could use the distribution archives to speed up by not needing to compile everything).
For the second issue, I have no idea at the moment. Something related to qemu version ? related to the linux kernel used to launch qemu ? a xen dependencie in the VM that is not of the correct version ? Lot of testing required to reduce the possibilities. (Try with gpu passthrough, without, with but without strict reset. Try all of the above but with non gpu PCI device. Try different kernel version (since it is directly related to the linux kernel version used ))
Update
For the second issue it feel like it is related to the xen_blkfront and xen_blkback drivers in the linux kernel. Maybe that a xen hypervisor version require guest to have some specific version of the linux kernel. Anyway, won’t focus on this issue.
For the first issue, kernel log indicate (on my zen4 computer, HVM kernel is 6.0.12):
Thanks a lot for information, I’m just a bit overwhelmed with information about Ryzen on the forum (used Intel for Qubes OS for ages). But Ryzen due to its performance looks promising and tempting.
Will use Intel for some time more.
Thanks again, you work with Ryzen is very appreciated.
To test a lot of kernels quickly (my gaming hvm is a archlinux system): Index of /packages/l/linux/
mkinitcpio config file need to be modified to compress using gzip algorithm instead of zstd (old kernel doesn’t support it) mkinitcpio - ArchWiki
For bug n°2 some kernel information:
5.9.14 → Bug
5.10.1 → No bug
Something changed between those 2 linux kernel version, and xen hypervisor 4.17+ doesn’t like when the guest use a kernel without this unknown modification
For bug n°1, some kernel information:
5.4.X → Work
5.6.1 → Work
5.6.15 → Work
5.7.1 → Don’t Work
5.10.1 → Don’t work
Kernel log that work:
[ 0.000000] Linux version 5.4.215-1-lts54 (linux-lts54@archlinux) (gcc version 12.2.0 (GCC)) #1 SMP Sun, 02 Oct 2022 14:41:08 +0000
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-lts54 root=/dev/xvda3 rw console=tty0 console=hvc0 swiotlb=8192 noresume clocksource=tsc xen_scrub_pages=0
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
--
[ 0.015188] 4 disabled
[ 0.015189] 5 disabled
[ 0.015189] 6 disabled
[ 0.015189] 7 disabled
[ 0.015190] TOM2: 0000000840000000 aka 33792M
[ 0.016127] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.016346] last_pfn = 0xdffff max_arch_pfn = 0x400000000
[ 0.018747] found SMP MP-table at [mem 0x000f5a40-0x000f5a4f]
[ 0.018814] check: Scanning 1 areas for low memory corruption
[ 0.018947] Using GB pages for direct mapping
[ 0.019069] RAMDISK: [mem 0x36eb5000-0x37751fff]
[ 0.019073] ACPI: Early table checksum verification disabled
[ 0.019077] ACPI: RSDP 0x00000000000F5990 000024 (v02 Xen )
[ 0.019081] ACPI: XSDT 0x00000000FC00A660 000054 (v01 Xen HVM 00000000 HVML 00000000)
[ 0.019087] ACPI: FACP 0x00000000FC00A370 0000F4 (v04 Xen HVM 00000000 HVML 00000000)
[ 0.019092] ACPI: DSDT 0x00000000FC001040 0092A3 (v02 Xen HVM 00000000 INTL 20190509)
[ 0.019095] ACPI: FACS 0x00000000FC001000 000040
[ 0.019097] ACPI: FACS 0x00000000FC001000 000040
[ 0.019099] ACPI: APIC 0x00000000FC00A470 000080 (v02 Xen HVM 00000000 HVML 00000000)
[ 0.019101] ACPI: HPET 0x00000000FC00A570 000038 (v01 Xen HVM 00000000 HVML 00000000)
[ 0.019103] ACPI: WAET 0x00000000FC00A5B0 000028 (v01 Xen HVM 00000000 HVML 00000000)
--
[ 0.245774] Last level dTLB entries: 4KB 1536, 2MB 1536, 4MB 768, 1GB 0
[ 0.245777] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.245779] Spectre V2 : Mitigation: Retpolines
[ 0.245779] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.245781] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[ 0.245782] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[ 0.246069] Freeing SMP alternatives memory: 32K
[ 0.247561] clocksource: xen: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.247564] Xen: using vcpuop timer interface
[ 0.247571] installing Xen timer for CPU 0
[ 0.247622] smpboot: CPU0: AMD Ryzen 7 1700 Eight-Core Processor (family: 0x17, model: 0x1, stepping: 0x1)
[ 0.247648] cpu 0 spinlock event irq 53
[ 0.247777] Performance Events: PMU not available due to virtualization, using software events only.
[ 0.247816] rcu: Hierarchical SRCU implementation.
[ 0.248226] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.248311] smp: Bringing up secondary CPUs ...
[ 0.248420] installing Xen timer for CPU 1
[ 0.248475] x86: Booting SMP configuration:
[ 0.248476] .... node #0, CPUs: #1
[ 0.251352] cpu 1 spinlock event irq 59
[ 0.251352] installing Xen timer for CPU 2
--
[ 0.556711] fbcon: Deferring console take-over
[ 0.556712] fb0: EFI VGA frame buffer device
[ 0.556794] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 0.556839] ACPI: Power Button [PWRF]
[ 0.556884] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 0.556900] ACPI: Sleep Button [SLPF]
[ 0.569601] xen: --> pirq=22 -> irq=24 (gsi=24)
[ 0.569976] xen:grant_table: Grant tables using version 1 layout
[ 0.570025] Grant table initialized
[ 0.571348] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[ 0.572361] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
[ 0.572362] AMD-Vi: AMD IOMMUv2 functionality not available on this system
[ 0.573609] usbcore: registered new interface driver usbserial_generic
[ 0.573615] usbserial: USB Serial support registered for generic
[ 0.574217] rtc_cmos 00:02: registered as rtc0
[ 0.574237] rtc_cmos 00:02: alarms up to one day, 114 bytes nvram, hpet irqs
[ 0.575898] ledtrig-cpu: registered to indicate activity on CPUs
[ 0.575973] drop_monitor: Initializing network drop monitor service
[ 0.576260] NET: Registered protocol family 10
[ 0.584973] Segment Routing with IPv6
[ 0.585010] NET: Registered protocol family 17
[ 0.587102] RAS: Correctable Errors collector initialized.
--
[ 2.902602] AES CTR mode by8 optimization enabled
[ 2.930571] xen: --> pirq=51 -> irq=45 (gsi=45)
[ 2.930858] snd_hda_intel 0000:00:07.0: Force to non-snoop mode
[ 2.962935] input: HDA ATI HDMI HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:07.0/sound/card0/input7
[ 2.966432] input: HDA ATI HDMI HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:07.0/sound/card0/input8
[ 2.966477] input: HDA ATI HDMI HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:07.0/sound/card0/input9
[ 2.966509] input: HDA ATI HDMI HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:07.0/sound/card0/input10
[ 2.966539] input: HDA ATI HDMI HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:07.0/sound/card0/input11
[ 2.966568] input: HDA ATI HDMI HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:07.0/sound/card0/input12
[ 3.319131] Decoding supported only on Scalable MCA processors.
[ 3.361707] [drm] amdgpu kernel modesetting enabled.
[ 3.361868] CRAT table not found
[ 3.361872] Virtual CRAT table created for CPU
[ 3.361873] Parsing CRAT table with 1 nodes
[ 3.361875] Creating topology SYSFS entries
[ 3.361891] Topology: Add CPU node
[ 3.361891] Finished initializing topology
[ 3.361961] amdgpu 0000:00:06.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[ 3.361962] amdgpu 0000:00:06.0: remove_conflicting_pci_framebuffers: bar 2: 0xf2000000 -> 0xf21fffff
[ 3.361963] amdgpu 0000:00:06.0: remove_conflicting_pci_framebuffers: bar 5: 0xf2200000 -> 0xf223ffff
[ 3.363082] xen: --> pirq=50 -> irq=40 (gsi=40)
[ 3.363947] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1043:0x0525 0xE7).
[ 3.363957] [drm] register mmio base: 0xF2200000
[ 3.363958] [drm] register mmio size: 262144
[ 3.364313] [drm] add ip block number 0 <vi_common>
[ 3.364315] [drm] add ip block number 1 <gmc_v8_0>
[ 3.364315] [drm] add ip block number 2 <tonga_ih>
[ 3.364316] [drm] add ip block number 3 <gfx_v8_0>
[ 3.364317] [drm] add ip block number 4 <sdma_v3_0>
[ 3.364318] [drm] add ip block number 5 <powerplay>
[ 3.364319] [drm] add ip block number 6 <dm>
[ 3.364320] [drm] add ip block number 7 <uvd_v6_0>
[ 3.364321] [drm] add ip block number 8 <vce_v3_0>
[ 3.366302] amdgpu 0000:00:06.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[ 3.496838] ATOM BIOS: 115-D009PI2-101
[ 3.496882] [drm] UVD is enabled in VM mode
[ 3.496883] [drm] UVD ENC is enabled in VM mode
[ 3.496886] [drm] VCE enabled in VM mode
[ 3.496909] [drm] GPU posting now...
[ 3.633089] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ 3.640931] amdgpu 0000:00:06.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[ 3.640933] amdgpu 0000:00:06.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[ 3.640941] [drm] Detected VRAM RAM=4096M, BAR=256M
[ 3.640942] [drm] RAM width 256bits GDDR5
[ 3.640976] [drm] amdgpu: 4096M of VRAM memory ready
[ 3.640980] [drm] amdgpu: 4096M of GTT memory ready.
[ 3.640998] [drm] GART: num cpu pages 65536, num gpu pages 65536
[ 3.642682] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 3.650499] [drm] Chained IB support enabled!
[ 3.666488] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[ 3.687069] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[ 3.697181] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[ 3.764057] [drm] DM_PPLIB: values for Engine clock
[ 3.764058] [drm] DM_PPLIB: 300000
[ 3.764059] [drm] DM_PPLIB: 600000
[ 3.764059] [drm] DM_PPLIB: 900000
[ 3.764060] [drm] DM_PPLIB: 1162000
[ 3.764060] [drm] DM_PPLIB: 1233000
[ 3.764060] [drm] DM_PPLIB: 1275000
[ 3.764061] [drm] DM_PPLIB: 1319000
--
[ 3.764062] [drm] DM_PPLIB: level : 8
[ 3.764063] [drm] DM_PPLIB: values for Memory clock
[ 3.764064] [drm] DM_PPLIB: 300000
[ 3.764064] [drm] DM_PPLIB: 1000000
[ 3.764064] [drm] DM_PPLIB: 1750000
[ 3.764065] [drm] DM_PPLIB: Validation clocks:
[ 3.764065] [drm] DM_PPLIB: engine_max_clock: 136000
[ 3.764065] [drm] DM_PPLIB: memory_max_clock: 175000
[ 3.764066] [drm] DM_PPLIB: level : 8
[ 3.764643] [drm] Display Core initialized with v3.2.48!
[ 3.764760] snd_hda_intel 0000:00:07.0: bound 0000:00:06.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[ 3.766124] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 3.766125] [drm] Driver supports precise vblank timestamp query.
[ 3.792520] [drm] UVD and UVD ENC initialized successfully.
[ 3.893472] [drm] VCE initialized successfully.
[ 3.894823] kfd kfd: Allocated 3969056 bytes on gart
[ 3.895542] Virtual CRAT table created for GPU
[ 3.895543] Parsing CRAT table with 1 nodes
[ 3.895550] Creating topology SYSFS entries
[ 3.896148] Topology: Add dGPU node [0x67df:0x1002]
[ 3.896154] kfd kfd: added device 1002:67df
[ 3.896214] [drm] Cannot find any crtc or sizes
[ 3.900312] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:00:06.0 on minor 1
[ 3.903678] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input6
[ 3.935772] Decoding supported only on Scalable MCA processors.
[ 4.075597] Decoding supported only on Scalable MCA processors.
[ 4.165632] Decoding supported only on Scalable MCA processors.
Update
(Lot more testing required)
An interesting warning line I found during the installation of different kernel version.
On kernel that don’t work with GPU passthrough, this warning appear during the installation: ==> WARNING: Possibly missing firmware for module: xhci_pci
Lot more testing required, but that seems to be something interesting
update 2
On my old computer, updated all the dependencies of my linux gaming HVM.
Installed AUR (en) - mkinitcpio-firmware to get ride of all the warning.
Upgrade to the latest 5.4 kernel too.
Now gpu passthrough don’t work on any kernel version.
Error messages are:
T_T now time to downgrade things randomly until it break in a different way. Now at least I known that for this particular issue, it is not related to xen nor kernel
Compiling old kernel is quite annoying. It doesn’t work on archlinux or any system with recent package.
So created a debian-11 qubes, then compiled and installed pacman from source.
Then cloned this aur package AUR (en) - linux-lts54 and adjusted the PKGBUILD file to compile the kernel version I want.
Starting to compile kernel 5.6.19.
The goal is to find the exact kernel version that introduced the bug, then switch to git bisect to find the exact problematic commit.
Going to take many many hours. Going to update this post when I start to have interesting result with my compilation
5.6.19: Work
5.7: Don’t work
5.7-rc1: Not bootable
Compiled from tarball:
5.6.19: Work
5.7: Don’t work
Compiled from git:
5.7-rc1: Not bootable
5.7-rc7: Don’t work
5.7-rc2: Not bootable
5-7-rc4: Not bootable
5-7-rc5: Not bootable
Not bootable means: Stackoverflow in the kernel directly. Probably related with xen things.
With my luck, the commit I am looking for is inside the range of release candidate I can’t use due to stackoverflow. So would first find which commit fix the kernel stackoverflow, then backport it to continue searching more my GPU passthrough issue.
And if murphy decide to be extra mean, both commit are related, just for extra suffering for debugging.
I hate computer.
Hahaha. Twenty years ago, in a similar situation I stopped to actively work with computers for living, “realizing” that computers are the greatest hoax of the 20th century.
Thanks @fjdh , but in that case I am looking to find the commit that changed the behavior regarding the GPU passthrough, I know it is between 5.6.19 and 5.7-rc5, so I need to search between that.
@enmus ahah, I am not at this point yet, but I do understand. This xkcd is great: xkcd: Shouldn't Be Hard
For the kernel stackoverflow, after reading some commits, it seems it is a bug related to multi cpu support (vcpu in my case), configuring my hvm to use only 1 vcpu seems to be a valid workaround.
5.7-rc5: Don’t work
5.7-rc1: Don’t work
So the regression have been introduced with the rc1. Now the bisect can start
1cd377baa91844b9f87a2b72eabf7ff783946b5e: Different error, related to graphics ( xorg refuse to start but that is not the root issue. Can execute command inside the VM, the error message doesn’t show. But can’t launch anything related to X, qubes daemon doesn’t work)
2bcb4fd6ba9152c699d873ffa4593d5a4fe1f8d4: Work
0e1b4271078787d3408d3dd314d80b290578cc00: Work
9b06860d7c1f1f4cb7d70f92e47dfa4a91bd5007: Don’t work
So the bug have been introduced between 08 april 2020 and 09 april 2020
aa317d3351dee7cb0b27db808af0cd2340dcbaef: Work
9bb50ed7470944238ec8e30a94ef096caf9056ee: Don’t work
8 commits left
Good idea, will do that once I found the problematic commit.
( I did some shit on how I tried to bisect, so going to take more time to identify the faulty commit)
9bb715260ed4cef6948cb2e05cf670462367da71: don’t work
34183ddd13dbfa859c4b68d16a30aad2cce72b11: don’t work
5c8db3eb381745c010ba746373a279e92502bdc8: don’t work
4646de87d32526ee87b46c2e0130413367fb5362: work
f14a9532ee30c68a56ff502c382860f674cc180c: don’t work
I am having a lot of difficulties with the bisect process.
The issue being that there is a big range of commits that result is stackoverflow or other unrecoverable kernel error.
However the scope of potentials commits is greatly reduced.
With the latest tests:
6afe6929964bca6847986d0507a555a041f07753: don’t work
ff36e78fdb251b9fa65028554689806961e011eb: work
05f3a6f5e478f622f548314471382df5b0f9dbf8: work
83794ee6c13b41c7db86ccfcaa20dc360b08fdb6: don’t work
It is a merge of many commits, and due to some git magic that I do not understand, I am unable to bisect inside this merge.
So instead I am trying to cherry picking all of the commits of this merge and recompiling after few cherry-pick, testing, etc.
There is definitely a lot of thing I do not understand on how git work. And really don’t understand why “git bisect” doesn’t give acceptable result
cherry picking until db70e2c13983926d8d657db3e740264b75ad20a4 : work
cherry picking until c16904b0f305c5f6bc31de118d4b1e60a5da5408: don’t work
The specific problematic commit of the merge seems to be 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408
The problematic patch is
@@ -170,10 +170,16 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
}
if (amdgpu_device_supports_boco(dev) &&
- (amdgpu_runtime_pm != 0)) /* enable runpm by default */
+ (amdgpu_runtime_pm != 0)) /* enable runpm by default for boco */
adev->runpm = true;
else if (amdgpu_device_supports_baco(dev) &&
- (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+ (amdgpu_runtime_pm != 0) &&
+ (adev->asic_type >= CHIP_TOPAZ) &&
+ (adev->asic_type != CHIP_VEGA20) &&
+ (adev->asic_type != CHIP_ARCTURUS)) /* enable runpm on VI+ */
+ adev->runpm = true;
+ else if (amdgpu_device_supports_baco(dev) &&
+ (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 on CI */
adev->runpm = true;
/* Call ACPI methods: require modeset init
To fix my issue, I modified the integer comparison from amdgpu_runtime_pm != 0 back to amdgpu_runtime_pm > 0
(reported to the amd driver issue tracker)
Now trying to apply this modification to the 6.1 kernel