Ryzen 7000 serie

0339eb95403fb4664219be344a9399a3fdf1fae1: don’t work

So only 2 possible commit. The most probable one seems this rewrite of vdpa Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/gi… · torvalds/linux@9bb7152 · GitHub
Will confirm tomorrow. But if it is this commit, it is going to be hard to find the fix without understanding how vdpa work.
This line could be interesting, noting for later: Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/gi… · torvalds/linux@9bb7152 · GitHub

2 Likes

You probably can write to the commit author, maybe they will be willing to help.

Good idea, will do that once I found the problematic commit.

( I did some shit on how I tried to bisect, so going to take more time to identify the faulty commit)

9bb715260ed4cef6948cb2e05cf670462367da71: don’t work
34183ddd13dbfa859c4b68d16a30aad2cce72b11: don’t work
5c8db3eb381745c010ba746373a279e92502bdc8: don’t work
4646de87d32526ee87b46c2e0130413367fb5362: work
f14a9532ee30c68a56ff502c382860f674cc180c: don’t work

4 Likes

I am having a lot of difficulties with the bisect process.
The issue being that there is a big range of commits that result is stackoverflow or other unrecoverable kernel error.

However the scope of potentials commits is greatly reduced.
With the latest tests:

6afe6929964bca6847986d0507a555a041f07753: don’t work
ff36e78fdb251b9fa65028554689806961e011eb: work
05f3a6f5e478f622f548314471382df5b0f9dbf8: work
83794ee6c13b41c7db86ccfcaa20dc360b08fdb6: don’t work

from the git history it have a big probabilty of being this commit https://github.com/torvalds/linux/commit/83794ee6c13b41c7db86ccfcaa20dc360b08fdb6

Will try to ask for help in the drm issue.
And also try to compile kernel 6.1 with some part of this commit removed.

4 Likes

git bisect visualize is a nice tool.
Got a bit lost with merge that add a lots of commits in the past.
Helped me to understand what is going on

4825b61a3d39eceef7db723808103aa60fc24520: work
a2ae604da74dcf9ae674d3c03efad80574952800: don’t work

3 Likes

I confirm that the issue is this commit

It is a merge of many commits, and due to some git magic that I do not understand, I am unable to bisect inside this merge.
So instead I am trying to cherry picking all of the commits of this merge and recompiling after few cherry-pick, testing, etc.

There is definitely a lot of thing I do not understand on how git work. And really don’t understand why “git bisect” doesn’t give acceptable result

cherry picking until db70e2c13983926d8d657db3e740264b75ad20a4 : work

cherry picking until c16904b0f305c5f6bc31de118d4b1e60a5da5408: don’t work

The specific problematic commit of the merge seems to be 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408

The problematic patch is

@@ -170,10 +170,16 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 	}
 
 	if (amdgpu_device_supports_boco(dev) &&
-	    (amdgpu_runtime_pm != 0)) /* enable runpm by default */
+	    (amdgpu_runtime_pm != 0)) /* enable runpm by default for boco */
 		adev->runpm = true;
 	else if (amdgpu_device_supports_baco(dev) &&
-		 (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+		 (amdgpu_runtime_pm != 0) &&
+		 (adev->asic_type >= CHIP_TOPAZ) &&
+		 (adev->asic_type != CHIP_VEGA20) &&
+		 (adev->asic_type != CHIP_ARCTURUS)) /* enable runpm on VI+ */
+		adev->runpm = true;
+	else if (amdgpu_device_supports_baco(dev) &&
+		 (amdgpu_runtime_pm > 0))  /* enable runpm if runpm=1 on CI */
 		adev->runpm = true;
 
 	/* Call ACPI methods: require modeset init

To fix my issue, I modified the integer comparison from
amdgpu_runtime_pm != 0 back to amdgpu_runtime_pm > 0

(reported to the amd driver issue tracker)

Now trying to apply this modification to the 6.1 kernel

6 Likes

This fix is enough for some kernel version.
Tested with kernel 5.13, all good
With 6.1, too many things have changed, for it to work.

Let’s try to bisect when/where this fix stop being enough.

First tries:
ccd1950c2f7e38ae45aeefb99a08b39407cd6c63: bad
5745d647d5563d3e9d32013ad4e5c629acff04d7: bad
2ba047855096fff551402a87272b520fe97323f5: bad

5 Likes

It seems that setting amdgpu.runpm=0 and also setting pci=nomsi fix the issue
related link:

It seems it is a xen bug related to missing MSI

probably related ? (definitely need to be tested) :

anyway with this config it seems I can use my RX 580 with kernel 6.1 on my old computer.
Now let’s try to transfert the VM to my new computer

5 Likes

Transfered the VM: It does NOT work.

Error message is this one: Ryzen 7000 serie - #41 by neowutran

  • The VM is identical from the one on my old computer (Used qubes backup/restore to transfert it)
  • The GPU is identical from the one on my old computer (Unplugged it from the old to plug it in the new)

Sooo, the only difference I see are:

  • Not the same motherboard / cpu. Can’t do anything about that
  • Not the same xen version ( On the old computer I am using the standard 4.14 xen version, on the new I am using the master branch (4.18 unstable)
  • ???

So guess I am back at compiling qubes iso with different xen version to check if it make a difference.
Or if anyone have a better idea :slight_smile:

5 Likes

Created a Qubes R4.2 ISO with all the needed patches.
Installed it on both my old computer and new computer.
Configured everything the same way.
Used the exact same GPU.

Old computer: it work
New computer: it don’t work (same error as before).

So I ruled out all the drivers and software issues. What is left is hardware issue on my new computer (eveything except the GPU. The issue is most likely related to the motherboard).
Guess I am only left with:

  • Randomly modifying the bios configuration (highly unlikely to change anything)
  • Be a bit more annoying with the ASUS support so that they fix their shitty hardware
  • RMA the motherboard and use another brand

Or if anyone have any idea, I am taking anything

Update: Got an answer from asus, basically, they escalated to AMD, they know there is a bug in the linux kernel, and apparently they got the middle finger when talking about a fix. So it won’t be fixed.
Maybe they will do some effort when they will release their server cpu, but either I find a workaround myself, or resell the computer, or transform it to a compilation machine (since it have proven to be quite impressive with compilation)

3 Likes

Sorry if you already mentioned trying this somewhere:
Have you tried disabling resizable bar in bios?

And also patched stubdom-linux-rootfs.gz per Contents/windows-gaming-hvm.md at master · Qubes-Community/Contents · GitHub

sadly already tried all of that

Was a bit frustrated, but have not tried yet to use a Windows VM on the new computer, could help to really rule out driver issues. Since AMD known there is bugs in the linux kernel, maybe they fixed some things in the windows kernel

Update 1: Windows VM doesn’t work as well.
Update 2: my error seems linked to bios parameter “above 4g” (at least it crash in a different way)
I swear, this bios is satan’s work. Everything break in unexpected way, settings some options require hard resetting the bios, if you set a parameter either it is not respected, break something completely unrelated, or, time to time, do what it is supposed to do. Take 5 minutes between each reboot for just the bios page to show up. I am becoming crazy.

By modifying “SRVIO”, “Above 4G”, and “Resizable BAR” in the bios, I can reach differents errors. Some buffer overflow in the linux kernel and other things, going to take a lot of time to understand what is going on. Host crashing if using less than X amount of ram etc.

Update: Removed this patch qemu: fix TOLUD for PCI passthrough by mati7337 · Pull Request #44 · QubesOS/qubes-vmm-xen-stubdom-linux · GitHub and manually modified rootfs.
Now the host is crashing reliably no matter the amount of ram provided to the VM. Don’t know if I am closer to a solution or not. This host crash append when the VM is trying to load the gpu drivers (same behavior for windows vm and for linux vm )

1 Like

New try.
Based on what I tested, I am left only with a motherboard / bios issue. I cannot see how it could be any other things.
However, there is too much possibilities to have any credible chance of success by blindly trying bios parameter.
With my latest configuration, everything seems to work until the guest VM try to load the driver (windows vm or linux vm, it doesn’t matter).

Maybe the GPU is not in a state it can receive instruction ?
So let’s compare the state of the GPU just before doing the gpu passthrough, on my new computer and on my old computer.
To do that I will list the value of all files of /sys/bus/pci/devices/ID_OF_THE_GPU.0/ , and /sys/bus/pci/devices/ID_OF_THE_GPU.1/
( If anyone have a better way / tools to display the information of GPU PCI device (without loading its driver), don’t hesitate to tell me)

New computer, GPU itself (.0):

  • aer_dev_correctable: every value are equals to 0, TOTAL_ERR_COR 0
  • aer_dev_fatal: same
  • aer_dev_nonfatal: same
  • ari_enabled: 1
  • boot_vga: 0
  • broken_parity_status: 0
  • class: 0x030000
  • config: (it is binary data, so instead, writing md5sum: 31fcb27e49505711aae05bfdedc1b4ea )
  • consistent_dma_mask_bits: 32
  • consumer:pci:XXXXXX : (irrelevant)
  • current_link_speed: 8.0 GT/s PCIe
  • current_link_width: 16
  • d3cold_allowed: 1
  • device: 0x67df
  • dma_mask_bits: 32
  • driver: pciback
  • driver_override: (null)
  • enable: 0
  • firmware_node: (irrelevant ?)
  • irq: 24
  • link - clkpm : 0
  • link - l1_1_pcipm : 0
  • link - l1_aspm : 1
  • local_cpulist: 0-15
  • local_cpus: ffff
  • max_link_speed: 8.0 GT/s
  • max_link_width: 16
  • modalias: pci:v00001002d000067DFsv00001043sd00008877bc03sc00i00
  • msi_bus: 1
  • numa_node: -1
  • power - autosuspend_delay_ms: Input/output error
  • power - runtime_active_time: 2070000
  • power - control: on
  • power - runtime_status: active
  • power - runtime_suspended_time: 0
  • power - wakeup: disabled
  • power - wakeup_* : (everything is empty)
  • power_state: D0
  • reset_method: bus
  • revision: 0xe7
  • vendor: 0x1002
  • subsystem_device: 0x8877
  • subsystem_vendor: 0x1043

old computer (only noting different values):

  • config: (it is binary data, md5sum: 9afd7…)
  • current_link_width: 4 (nothing surprising here, plugged it in the first slot available)
  • irq: 50
  • local_cpulist: 0-7
  • local_cpus: ff
  • modalias: pci:v00001002d000067DFsv00001043sd00000525bc03sc00i00
  • power - runtime_active_time: 692657
  • subsystem_device: 0x0525

So nothing that look interesting, need to find another idea

2 Likes

So you believe it could be an issue with your specific PC? I could try testing the .iso on my Zephyrus, it may help narrow down the issue if you had a few more testers.

edit: oh right, the whole point is getting passthrough working.

If I were you, I’d register an account on the level1techs forum, there’s lots of people there who are interested in (and experienced with) passhtrough there.

I was able to passthrough a Nvidia 1070 with the new computer.
( The NVIDIA driver seems to not like pci=nomsi, if I remove it, everything work correctly with nvidia. Settings NVreg_EnableMSI=0 does not seem to be necessary)

So it is very strange that I am unable to passthrough the RX580.

I am going to reboot few time to confirm that I can always reliably passthrough the nvidia 1070.

@Cpotts
I will link you a R4.2 iso later this week

@fjdh
good idea, for the moment I just posted that on reddit Zen4 - RX580 - Xen : VFIO

Update:

  • Passthrough of 1070 work reliably

Going to try all of that: drm/amdgpu AMDgpu driver — The Linux Kernel documentation

2 Likes

There might also be signed weekly iso builds for 4.2 just around the corner:

/me refresh Index of /qubes/iso/ a third time … :wink:

Successfull GPU Passthrough of the RX 580 on my new computer !!!

The last step ( I don’t know if the other one were required ) was to … downgrade linux kernel back to 5.4 LTS
So there is another issue in recent version of the linux kernel, that I will need to bisect

So now, will try to remove as much modifications I did as I can ( to check if only the linux kernel downgrade is required ( from my previous tests I expect that it require more than just kernel downgrade ) ).

Currently the passthrough conditions are:

  • Using a old kernel (Currently 5.9)
  • not passingthrough the audio part of the GPU
  • In the bios: “Resizable BAR support” must be disabled
  • In the bios: “CSM Support” must be disabled

Something interesting, on my old computer, I needed to use the boot parameter pci=nomsi to passthrough the RX 580 on kernel >= 5.7. This seems not required with the new computer, it work fine until 5.10.

I now need to find the required boot parameter to use a recent kernel, and bisect the change in the amdgpu driver introduced for kernel 5.10.

Also anyone have a theory on why I can’t passthrough the audio part of the GPU, but only on my new computer ?

8 Likes

Wow, you have enable an entirely new tier of hardware to work with Qubes OS. Bravo! It just goes to show how dedicated this community is to Qubes and security.

1 Like

Congrats. :slight_smile:
This is on xen 4.17/18? So 1070 works, rx 580 doesn’t, on k 5.10.x or any newer kernel? Again, I’d ask around on l1t forum, specifically the user gnif did a lot of work on getting passthrough to work in 2020/2021. He might have an idea whether there were changes relating to this in that interval.