[4.2] Sudden soft lockup, probably related to AMD Radeon

v6ak · November 23, 2023, 5:09pm

So far, I’ve got two total freezes last week. In the first case, there were nothing in journalctl IIRC. In the second case, there was some info about soft lockup:

Nov 23 17:13:39 dom0 kernel: xen-blkback: backend/vbd/45/51712: using 2 queues, protocol 1 (x86_64-abi) persistent grants
Nov 23 17:13:39 dom0 kernel: xen-blkback: backend/vbd/45/51728: using 2 queues, protocol 1 (x86_64-abi) persistent grants
Nov 23 17:13:39 dom0 kernel: xen-blkback: backend/vbd/45/51744: using 2 queues, protocol 1 (x86_64-abi) persistent grants
Nov 23 17:14:04 dom0 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [Xorg:3761]
Nov 23 17:14:04 dom0 kernel: Modules linked in: snd_seq_dummy snd_hrtimer nct6775 nct6775_core hwmon_vid lm83 jc42 vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_aud
io snd_hda_codec_hdmi intel_rapl_msr snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_common snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm
 snd_timer wmi_bmof pcspkr snd r8169 soundcore k10temp i2c_piix4 gpio_amdpt gpio_generic loop fuse xenfs dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt amdgpu amdxcp iommu
_v2 drm_buddy gpu_sched hid_elecom radeon crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper ttm video i2c_algo_bit drm_suballoc_helper drm_
display_helper xhci_pci ghash_clmulni_intel xhci_pci_renesas sha512_ssse3 cec ccp nvme xhci_hcd sp5100_tco nvme_core nvme_common wmi xen_acpi_processor xen_privcmd xen_pciback xen
_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath i2c_dev
Nov 23 17:14:04 dom0 kernel: CPU: 2 PID: 3761 Comm: Xorg Not tainted 6.5.10-1.qubes.fc37.x86_64 #1
Nov 23 17:14:04 dom0 kernel: Hardware name: ASUS System Product Name/TUF GAMING B550-PLUS, BIOS 3404 10/07/2023
Nov 23 17:14:04 dom0 kernel: RIP: e030:smp_call_function_many_cond+0x121/0x4f0
Nov 23 17:14:04 dom0 kernel: Code: 63 d0 e8 d2 e5 61 00 3b 05 2c 53 d1 01 73 25 48 63 d0 49 8b 37 48 03 34 d5 00 eb 9c 82 8b 56 08 83 e2 01 74 0a f3 90 8b 4e 08 <83> e1 01 75 f6 83 c0 01 eb c1 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e
Nov 23 17:14:04 dom0 kernel: RSP: e02b:ffffc9004518f828 EFLAGS: 00000202
Nov 23 17:14:04 dom0 kernel: RAX: 0000000000000000 RBX: 0000000000000208 RCX: 0000000000000011
Nov 23 17:14:04 dom0 kernel: RDX: 0000000000000001 RSI: ffff888134c3bb80 RDI: ffff888100075ee0
Nov 23 17:14:04 dom0 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
Nov 23 17:14:04 dom0 kernel: R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff888134cb5100
Nov 23 17:14:04 dom0 kernel: R13: 0000000000000001 R14: 0000000000000002 R15: ffff888134cb5100
Nov 23 17:14:04 dom0 kernel: FS:  00007150d2fc4a80(0000) GS:ffff888134c80000(0000) knlGS:0000000000000000
Nov 23 17:14:04 dom0 kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 17:14:04 dom0 kernel: CR2: 00007c7d0a39e020 CR3: 0000000128e2c000 CR4: 0000000000050660
Nov 23 17:14:04 dom0 kernel: Call Trace:
Nov 23 17:14:04 dom0 kernel:  <IRQ>
Nov 23 17:14:04 dom0 kernel:  ? watchdog_timer_fn+0x1b8/0x220
Nov 23 17:14:04 dom0 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 23 17:14:04 dom0 kernel:  ? __hrtimer_run_queues+0x112/0x2b0
Nov 23 17:14:04 dom0 kernel:  ? hrtimer_interrupt+0xf8/0x230
Nov 23 17:14:04 dom0 kernel:  ? xen_timer_interrupt+0x22/0x30
Nov 23 17:14:04 dom0 kernel:  ? __handle_irq_event_percpu+0x4a/0x1a0
Nov 23 17:14:04 dom0 kernel:  ? handle_irq_event_percpu+0x13/0x40
Nov 23 17:14:04 dom0 kernel:  ? handle_percpu_irq+0x3b/0x60
Nov 23 17:14:04 dom0 kernel:  ? handle_irq_desc+0x3e/0x50
Nov 23 17:14:04 dom0 kernel:  ? __evtchn_fifo_handle_events+0x1b4/0x1e0
Nov 23 17:14:04 dom0 kernel:  ? __xen_evtchn_do_upcall+0x65/0xb0
Nov 23 17:14:04 dom0 kernel:  ? __xen_pv_evtchn_do_upcall+0x21/0x30
Nov 23 17:14:04 dom0 kernel:  ? xen_pv_evtchn_do_upcall+0x85/0xb0
Nov 23 17:14:04 dom0 kernel:  </IRQ>
Nov 23 17:14:04 dom0 kernel:  <TASK>
Nov 23 17:14:04 dom0 kernel:  ? exc_xen_hypervisor_callback+0x8/0x20
Nov 23 17:14:04 dom0 kernel:  ? smp_call_function_many_cond+0x121/0x4f0
Nov 23 17:14:04 dom0 kernel:  ? smp_call_function_many_cond+0xfe/0x4f0
Nov 23 17:14:04 dom0 kernel:  ? __pfx_do_flush_tlb_all+0x10/0x10
Nov 23 17:14:04 dom0 kernel:  on_each_cpu_cond_mask+0x24/0x40
Nov 23 17:14:04 dom0 kernel:  __purge_vmap_area_lazy+0xd6/0x7d0
Nov 23 17:14:04 dom0 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 23 17:14:04 dom0 kernel:  ? xa_find+0x90/0xe0
Nov 23 17:14:04 dom0 kernel:  _vm_unmap_aliases+0x264/0x2d0
Nov 23 17:14:04 dom0 kernel:  change_page_attr_set_clr+0xb4/0x1a0
Nov 23 17:14:04 dom0 kernel:  _set_pages_array+0xc3/0x110
Nov 23 17:14:04 dom0 kernel:  ttm_pool_alloc+0x410/0x540 [ttm]
Nov 23 17:14:04 dom0 kernel:  ttm_tt_populate+0xa1/0x130 [ttm]
Nov 23 17:14:04 dom0 kernel:  ttm_bo_handle_move_mem+0x162/0x170 [ttm]
Nov 23 17:14:04 dom0 kernel:  ttm_bo_validate+0xe5/0x180 [ttm]
Nov 23 17:14:04 dom0 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 23 17:14:04 dom0 kernel:  ttm_bo_init_reserved+0x146/0x170 [ttm]
Nov 23 17:14:04 dom0 kernel:  ttm_bo_init_validate+0x5a/0xe0 [ttm]
Nov 23 17:14:04 dom0 kernel:  ? __pfx_radeon_ttm_bo_destroy+0x10/0x10 [radeon]
Nov 23 17:14:04 dom0 kernel:  radeon_bo_create+0x153/0x1e0 [radeon]
Nov 23 17:14:04 dom0 kernel:  ? __pfx_radeon_ttm_bo_destroy+0x10/0x10 [radeon]
Nov 23 17:14:04 dom0 kernel:  radeon_gem_object_create+0xb7/0x1c0 [radeon]
Nov 23 17:14:04 dom0 kernel:  ? ____sys_recvmsg+0xf5/0x1d0
Nov 23 17:14:04 dom0 kernel:  radeon_gem_create_ioctl+0x77/0x130 [radeon]
Nov 23 17:14:04 dom0 kernel:  ? __pfx_radeon_gem_create_ioctl+0x10/0x10 [radeon]
Nov 23 17:14:04 dom0 kernel:  drm_ioctl_kernel+0xcd/0x170
Nov 23 17:14:04 dom0 kernel:  drm_ioctl+0x267/0x4a0
Nov 23 17:14:04 dom0 kernel:  ? __pfx_radeon_gem_create_ioctl+0x10/0x10 [radeon]
Nov 23 17:14:04 dom0 kernel:  radeon_drm_ioctl+0x4d/0x80 [radeon]
Nov 23 17:14:04 dom0 kernel:  __x64_sys_ioctl+0x97/0xd0
Nov 23 17:14:04 dom0 kernel:  do_syscall_64+0x5f/0x90
Nov 23 17:14:04 dom0 kernel:  ? do_syscall_64+0x6b/0x90
Nov 23 17:14:04 dom0 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Nov 23 17:14:04 dom0 kernel:  ? do_syscall_64+0x6b/0x90
Nov 23 17:14:04 dom0 kernel:  ? exit_to_user_mode_prepare+0xa7/0xd0
Nov 23 17:14:04 dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Nov 23 17:14:04 dom0 kernel: RIP: 0033:0x7150d36b9e0f
Nov 23 17:14:04 dom0 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Nov 23 17:14:04 dom0 kernel: RSP: 002b:00007ffe1404ad10 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 23 17:14:04 dom0 kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007150d36b9e0f
Nov 23 17:14:04 dom0 kernel: RDX: 00007ffe1404ade0 RSI: 00000000c020645d RDI: 0000000000000019
Nov 23 17:14:04 dom0 kernel: RBP: 00007ffe1404ade0 R08: 0000000000000011 R09: 0000000000000010
Nov 23 17:14:04 dom0 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 00000000c020645d
Nov 23 17:14:04 dom0 kernel: R13: 0000000000000019 R14: 0000000000080000 R15: 00007150c810a010
Nov 23 17:14:04 dom0 kernel:  </TASK>

There are more messages like this, not sure if they are interesting.

Any ideas?

EDIT: I didn’t have swap. Can this be a root of the issue?

Syonyk · November 24, 2023, 5:31am

Lack of swap won’t cause that. That error means a CPU core went missing in action and isn’t responding to anything anymore (at least, not from the kernel’s perspective).

I’d spend a bit of time validating your hardware. Run a memtest, do some stress tests off a USB boot of Linux, etc. You shouldn’t be seeing that unless you’re doing something exceedingly funky, at which point you wouldn’t be asking the question. How’s CPU temperature look during normal operation?

I also spent a while running down a similar issue (random system freezes) that turned out, after an awful lot of investigation, to be a bad power supply.

tanky0u · November 24, 2023, 11:57am

One of my prime suspicions with my experience with QubesOS freezes, as well.

What did you do about bad power supply

v6ak · November 24, 2023, 1:08pm

Before the lockup

I was just testing SSD under heavy sequential read, specifically the temperatures. After concluding the temperatures to be OK, I stopped the experiment. The lockup occurred about 10 minutes later when starting a qube.

While this scenario isn’t usual, it isn’t exceedingly funky, and it looks unrelated (SSD vs. GPU drivers, some delay).

CPU temperature (looks OK)

I’ve tried an unrealistic load on all cores, which got the CPUTIN to 42 °C. After the load ended, the temperature was dropping quickly. (I believe I would get higher temperature after an hour of full load, but still…)

RAM (ECC, hard to test, unlikely cause)

The RAM is DDR4 ECC running with Zen 3 CPU (5800X). Doing a memtest is a tough job – the memory controller should correct memory errors, but it doesn’t tell the software having done so.

Corrected memory faults shouldn’t cause any issue.
Detected memory faults should cause just a crash, not a lockup.
Non-corrected (or improperly corrected) memory faults (as if the memory wasn’t ECC) tend to have more varying symptoms. (I’ve experienced this with an older computer with non-ECC DDR3.)

Swap (not sure)

I originally thought that it couldn’t be related, as a typical OOM doesn’t look like that. However, it depends what process hits the memory limit. If a user space process hits memory limit, it gets killed. However, if a kernel (or kernel module) hits the limit, the situation might be worse.

Also, I have told ChatGPT and Bard to analyse the stacktrace. Bard has highlighted the function ttm_bo_init_validate, which is related to memory allocation. (I’ve independently checked the function documentation.)

Maybe missing swap wasn’t the reason, but I’ve added a swap just to be sure.

GPU (maybe)

Well, I was recently experiencing few issues that point to GPU, most notably few bad glyphs in dom0 (i3wm and XFCE Terminal). Not sure if GPU renders the fonts, but it sounds plausible.

My previous issues (briefly)

This might be unrelated, but I am adding it for context:

Before the lockups, I was getting random reboots. They seem to have been of two categories:

Sudden reboot without any warning. Nothing in system log. Probably increased unsafe shutdown count on SSD. Those were happening even after a MoBo replacement and even in BIOS. Different PSU didn’t help. They have disappeared* after replacing the SSD. It looks like the old SSD has some cold joint causing temporary disconnects.
Sudden rapid slowdown, Wi-Fi disconnected, log messages dom0 kernel: xen-blkback: Scheduled work from previous purge is still busy, cannot purge list (they were stored => SSD wasn’t disconnected completely) and then a reboot (probably by a watchdog). Looks much like sudden PCIe throttling, but with no thermal cause. This issue persisted even after SSD replacement. After removal of the Wi-Fi card, they haven’t reappeared. (Due to the nature of randomly occurring issues, it is too early to say that they have disappeared.) This is a bit WTF, as the card was connected to the NetVM, which quite limits the potential impact. It might have been an EMI, a short circuit or some other low-level issue like some mess with rapid interrupts.

I am not 100% sure that these categories are distinct, i.e. I am not 100% sure that there hasn’t been any reboot with mixed symptoms.

After upgrading the SSD, I’ve also upgraded Qubes to 4.2 RC4 in order to make SSD firmware updates easier (although it didn’t help). I see, I an juggling with too many balls. However, the second type of reboot was occurring even in 4.2 RC4 before I removed the Wi-Fi card

PSU (not sure)

If the issues persist, I can try a different PSU for some time. Not sure if there is something to do at the moment.

*) Well, there was one occurrence of a similar issue when no SSD was connected, but it could be attributed to a specific situation that caused the power cable to be half-inserted. So I don’t count this single case.

v6ak · November 24, 2023, 9:53pm

My further digging points to the GPU.

Issues with fonts in dom0 (not applicable to domUs, which are rendered by CPU) suggest that. i3 uses XRender, which can use GPU. Still not 100% sure if it is used for font rendering.
I’ve looked up the past logs. There is a single occurrence of a GPU lockup. Also, this has happened before Qubes upgrade, i.e. with Qubes 4.1. The GPU lockup seems to have been followed by (probably spontaneous) reboot, maybe by watchdog.
In the past, GPU was sometimes showing some artifacts on right side when running at 144Hz (i.e., maximum screen frequency) and I had to downgrade it to 120Hz. While I don’t have stats for that, this was probably a predictor of spontaneous reboot.

Again, I’ve asked Bard and ChatGPT to analyze the log. Bard has highlighted VM_CONTEXT1_PROTECTION_FAULT_ADDR and provided some plausible description of VM_CONTEXT1_PROTECTION_FAULT_ADDR, but I was unable to verify if it is correct or some AI halucination.

Note that usual temperatures of my GPU are around 30°C in the current weather, so it doesn’t look like overheating.

GPU vs. PSU

I still cannot exclude the possibility that the root issue is in the PSU, which would cause GPU to misbehave. While one could expect more diverse issues from a bad PSU, However, still the GPU might be just the most sensitive component to some brownout. The GPU is powered just by +12V and +3.3V from the PCIe port, with no external power source. While my intuition suggests that GPU is more likely to blame than the PSU, I don’t seem to have a good argument supporting that.

(SSD was apparently a separate issue, as I had the same troubles even with another PSU and new SSD seems to have resolved them.)

Log of the GPU lockup

Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0: ring 0 stalled for more than 10008msec
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0: GPU lockup (current fence id 0x0000000000039370 last fence id 0x0000000000039388 on ring 0)
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0: Saved 1823 dwords of commands on ring 0.
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0: GPU softreset: 0x000003ED
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS               = 0xF5D04028
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0xEE400000
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   SRBM_STATUS               = 0x20024FC0
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00408006
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x84228647
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44E83566
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x60C83146
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0011AB25
Nov 15 17:03:45 dom0 kernel: radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05040020
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0: Wait for MC idle timedout !
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0: GRBM_SOFT_RESET=0x0000DDFF
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0: SRBM_SOFT_RESET=0x00128540
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS               = 0xC0003028
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x80000006
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   SRBM_STATUS               = 0x20000EC0
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x00000000
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Nov 15 17:03:46 dom0 kernel: radeon 0000:07:00.0: GPU reset succeeded, trying to resume
Nov 15 17:03:51 dom0 kernel: [drm:atom_op_jump [radeon]] *ERROR* atombios stuck in loop for more than 5secs aborting
Nov 15 17:03:51 dom0 kernel: [drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing BC86 (len 302, WS 0, PS 4) @ 0xBCB0
Nov 15 17:03:51 dom0 kernel: [drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing B48C (len 94, WS 12, PS 8) @ 0xB4D5
Nov 15 17:03:51 dom0 kernel: radeon 0000:07:00.0: Wait for MC idle timedout !
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: Wait for MC idle timedout !
Nov 15 17:03:52 dom0 kernel: [drm] PCIE GART of 2048M enabled (table at 0x0000000000165000).
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: WB enabled
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10
Nov 15 17:03:52 dom0 kernel: radeon 0000:07:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18
Nov 15 17:03:52 dom0 kernel: debugfs: File 'radeon_ring_gfx' in directory '0' already present!
Nov 15 17:03:52 dom0 kernel: debugfs: File 'radeon_ring_cp1' in directory '0' already present!
Nov 15 17:03:52 dom0 kernel: debugfs: File 'radeon_ring_cp2' in directory '0' already present!
Nov 15 17:03:52 dom0 kernel: debugfs: File 'radeon_ring_dma1' in directory '0' already present!
Nov 15 17:03:52 dom0 kernel: debugfs: File 'radeon_ring_dma2' in directory '0' already present!
Nov 15 17:03:52 dom0 kernel: [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
Nov 15 17:03:52 dom0 kernel: [drm:si_resume [radeon]] *ERROR* si startup failed on resume
Nov 15 17:03:52 dom0 kernel: [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
Nov 15 17:03:53 dom0 kernel: [drm:si_dpm_set_power_state [radeon]] *ERROR* si_restrict_performance_levels_before_switch failed
Nov 15 17:03:53 dom0 kernel: [drm:si_dpm_set_power_state [radeon]] *ERROR* si_restrict_performance_levels_before_switch failed
Nov 15 17:03:54 dom0 kernel: [drm:radeon_dp_link_train_cr [radeon]] *ERROR* displayport link status failed
Nov 15 17:03:54 dom0 kernel: [drm:radeon_dp_link_train_cr [radeon]] *ERROR* clock recovery failed
Nov 15 17:04:02 dom0 kernel: radeon 0000:07:00.0: ring 0 stalled for more than 10470mseca
Nov 15 17:04:02 dom0 kernel: radeon 0000:07:00.0: GPU lockup (current fence id 0x0000000000039370 last fence id 0x0000000000039388 on ring 0)

Syonyk · November 26, 2023, 8:40pm

I convinced myself, testing suspect hardware in other systems, that it was all fine, and that left the PSU. I hadn’t generally run into “partially failed PSUs” before, but this one certainly qualified. The system was getting progressively less stable over time, and while it would previously only lock up under heavy GPU compute loads, it started locking up at the desktop.

I was dead certain I had a bad GPU until I tested it in a different system, and it ran at full compute load for days on end without problem. Everything was consistent with a bad GPU overheating and locking up, including the system not POSTing until it had cooled down for a minute or two after heavy GPU load.

I was wrong.

I don’t know for sure what your issue is, obviously, but if you have a spare PSU, it would be worth trying it with that. The system has been rock solid since I replaced the PSU.

v6ak · February 13, 2024, 2:06pm

I had two GPUs for a week (maybe 5 days or so), got one random freeze without any explanation in logs. After removal of the old (unused) GPU, haven’t experienced those issues for months. So, probably…

Not sure if the reason is in hardware or in drivers. While both are AMD GPUs, the old one uses radeon driver, while the new one uses amdgpu.