AMD Raven Ridge APU freezing and crashing

wind.gmbh · July 23, 2021, 1:10pm

I have an AMD Raven Ridge (-like) APU with a Radeon Vega 8 (V1605B) as one of my R4.1 systems to play around with.

As R4.1 moved to kernel branches higher than 5.4 I have started to experience random freezes and crashes.
I was not always able to retrieve logs, but in cases I did, amdgpu has been involved.

If new releases of stable-5.10 or latest become available in the repositories, I try them out with mostly similar results.
If I do not want the system to freeze/ crash I start it with a self-compiled stable-5.4 kernel, which runs just rock solid.

The nature of the issues are especially Xorg crashes and the whole system freezing. In the first case I was able to obtain some kernel buffer messages, in the latter I was not able to get any logs.

Xorg crash example

[59968.665419] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32770, for process Xorg pid 2412 thread X:cs0 pid 2512)
[59968.665421] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800107a4a000 from client 27
[59968.665422] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[59968.665423] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[59968.665424] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[59968.665425] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[59968.665426] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[59968.665427] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[59968.665428] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x0
[59968.911907] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

This most likely a driver/ kernel/ Xen issue, but as here is the most likely place to find people who run Xen on Raven Ridge, I just wanted to drop this here to ask whether this is a problem that other people Raven Ridge users are having, too.

wind.gmbh · September 29, 2021, 9:39am

To anyone who is looking for a solution: I have applied a the Kernel parameter amdgpu.noretry=0, which appears to have solved the problem.

Xorg still throws these warnings occasionally:

[136780.958] (EE) event3  - HID 046a:0023: client bug: event processing lagging behind by 16ms, your system is too slow

Before applying the mentioned Kernel parameter, these events inevitably led to the system either freezing or Xorg crashing.
Now it is just lagging for a few seconds.

olf · August 14, 2022, 8:42am

This might be related to Issue2982 where downgrading the linux-firmware package solves the problem (temporarlily, until a newer stable version comes out).