Good idea, will do that once I found the problematic commit.
( I did some shit on how I tried to bisect, so going to take more time to identify the faulty commit)
9bb715260ed4cef6948cb2e05cf670462367da71: don’t work
34183ddd13dbfa859c4b68d16a30aad2cce72b11: don’t work
5c8db3eb381745c010ba746373a279e92502bdc8: don’t work
4646de87d32526ee87b46c2e0130413367fb5362: work
f14a9532ee30c68a56ff502c382860f674cc180c: don’t work
I am having a lot of difficulties with the bisect process.
The issue being that there is a big range of commits that result is stackoverflow or other unrecoverable kernel error.
However the scope of potentials commits is greatly reduced.
With the latest tests:
6afe6929964bca6847986d0507a555a041f07753: don’t work
ff36e78fdb251b9fa65028554689806961e011eb: work
05f3a6f5e478f622f548314471382df5b0f9dbf8: work
83794ee6c13b41c7db86ccfcaa20dc360b08fdb6: don’t work
It is a merge of many commits, and due to some git magic that I do not understand, I am unable to bisect inside this merge.
So instead I am trying to cherry picking all of the commits of this merge and recompiling after few cherry-pick, testing, etc.
There is definitely a lot of thing I do not understand on how git work. And really don’t understand why “git bisect” doesn’t give acceptable result
cherry picking until db70e2c13983926d8d657db3e740264b75ad20a4 : work
cherry picking until c16904b0f305c5f6bc31de118d4b1e60a5da5408: don’t work
The specific problematic commit of the merge seems to be 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408
The problematic patch is
@@ -170,10 +170,16 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
}
if (amdgpu_device_supports_boco(dev) &&
- (amdgpu_runtime_pm != 0)) /* enable runpm by default */
+ (amdgpu_runtime_pm != 0)) /* enable runpm by default for boco */
adev->runpm = true;
else if (amdgpu_device_supports_baco(dev) &&
- (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+ (amdgpu_runtime_pm != 0) &&
+ (adev->asic_type >= CHIP_TOPAZ) &&
+ (adev->asic_type != CHIP_VEGA20) &&
+ (adev->asic_type != CHIP_ARCTURUS)) /* enable runpm on VI+ */
+ adev->runpm = true;
+ else if (amdgpu_device_supports_baco(dev) &&
+ (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 on CI */
adev->runpm = true;
/* Call ACPI methods: require modeset init
To fix my issue, I modified the integer comparison from amdgpu_runtime_pm != 0 back to amdgpu_runtime_pm > 0
(reported to the amd driver issue tracker)
Now trying to apply this modification to the 6.1 kernel
Created a Qubes R4.2 ISO with all the needed patches.
Installed it on both my old computer and new computer.
Configured everything the same way.
Used the exact same GPU.
Old computer: it work
New computer: it don’t work (same error as before).
So I ruled out all the drivers and software issues. What is left is hardware issue on my new computer (eveything except the GPU. The issue is most likely related to the motherboard).
Guess I am only left with:
Randomly modifying the bios configuration (highly unlikely to change anything)
Be a bit more annoying with the ASUS support so that they fix their shitty hardware
RMA the motherboard and use another brand
Or if anyone have any idea, I am taking anything
Update: Got an answer from asus, basically, they escalated to AMD, they know there is a bug in the linux kernel, and apparently they got the middle finger when talking about a fix. So it won’t be fixed.
Maybe they will do some effort when they will release their server cpu, but either I find a workaround myself, or resell the computer, or transform it to a compilation machine (since it have proven to be quite impressive with compilation)
Was a bit frustrated, but have not tried yet to use a Windows VM on the new computer, could help to really rule out driver issues. Since AMD known there is bugs in the linux kernel, maybe they fixed some things in the windows kernel
Update 1: Windows VM doesn’t work as well.
Update 2: my error seems linked to bios parameter “above 4g” (at least it crash in a different way)
I swear, this bios is satan’s work. Everything break in unexpected way, settings some options require hard resetting the bios, if you set a parameter either it is not respected, break something completely unrelated, or, time to time, do what it is supposed to do. Take 5 minutes between each reboot for just the bios page to show up. I am becoming crazy.
By modifying “SRVIO”, “Above 4G”, and “Resizable BAR” in the bios, I can reach differents errors. Some buffer overflow in the linux kernel and other things, going to take a lot of time to understand what is going on. Host crashing if using less than X amount of ram etc.
New try.
Based on what I tested, I am left only with a motherboard / bios issue. I cannot see how it could be any other things.
However, there is too much possibilities to have any credible chance of success by blindly trying bios parameter.
With my latest configuration, everything seems to work until the guest VM try to load the driver (windows vm or linux vm, it doesn’t matter).
Maybe the GPU is not in a state it can receive instruction ?
So let’s compare the state of the GPU just before doing the gpu passthrough, on my new computer and on my old computer.
To do that I will list the value of all files of /sys/bus/pci/devices/ID_OF_THE_GPU.0/ , and /sys/bus/pci/devices/ID_OF_THE_GPU.1/
( If anyone have a better way / tools to display the information of GPU PCI device (without loading its driver), don’t hesitate to tell me)
New computer, GPU itself (.0):
aer_dev_correctable: every value are equals to 0, TOTAL_ERR_COR 0
aer_dev_fatal: same
aer_dev_nonfatal: same
ari_enabled: 1
boot_vga: 0
broken_parity_status: 0
class: 0x030000
config: (it is binary data, so instead, writing md5sum: 31fcb27e49505711aae05bfdedc1b4ea )
So you believe it could be an issue with your specific PC? I could try testing the .iso on my Zephyrus, it may help narrow down the issue if you had a few more testers.
edit: oh right, the whole point is getting passthrough working.
If I were you, I’d register an account on the level1techs forum, there’s lots of people there who are interested in (and experienced with) passhtrough there.
I was able to passthrough a Nvidia 1070 with the new computer.
( The NVIDIA driver seems to not like pci=nomsi, if I remove it, everything work correctly with nvidia. Settings NVreg_EnableMSI=0 does not seem to be necessary)
So it is very strange that I am unable to passthrough the RX580.
I am going to reboot few time to confirm that I can always reliably passthrough the nvidia 1070.
@Cpotts
I will link you a R4.2 iso later this week
Successfull GPU Passthrough of the RX 580 on my new computer !!!
The last step ( I don’t know if the other one were required ) was to … downgrade linux kernel back to 5.4 LTS
So there is another issue in recent version of the linux kernel, that I will need to bisect
So now, will try to remove as much modifications I did as I can ( to check if only the linux kernel downgrade is required ( from my previous tests I expect that it require more than just kernel downgrade ) ).
Currently the passthrough conditions are:
Using a old kernel (Currently 5.9)
not passingthrough the audio part of the GPU
In the bios: “Resizable BAR support” must be disabled
In the bios: “CSM Support” must be disabled
Something interesting, on my old computer, I needed to use the boot parameter pci=nomsi to passthrough the RX 580 on kernel >= 5.7. This seems not required with the new computer, it work fine until 5.10.
I now need to find the required boot parameter to use a recent kernel, and bisect the change in the amdgpu driver introduced for kernel 5.10.
Also anyone have a theory on why I can’t passthrough the audio part of the GPU, but only on my new computer ?
Wow, you have enable an entirely new tier of hardware to work with Qubes OS. Bravo! It just goes to show how dedicated this community is to Qubes and security.
Congrats.
This is on xen 4.17/18? So 1070 works, rx 580 doesn’t, on k 5.10.x or any newer kernel? Again, I’d ask around on l1t forum, specifically the user gnif did a lot of work on getting passthrough to work in 2020/2021. He might have an idea whether there were changes relating to this in that interval.