[4.2] Sudden soft lockup, probably related to AMD Radeon

v6ak · November 24, 2023, 1:08pm

Before the lockup

I was just testing SSD under heavy sequential read, specifically the temperatures. After concluding the temperatures to be OK, I stopped the experiment. The lockup occurred about 10 minutes later when starting a qube.

While this scenario isn’t usual, it isn’t exceedingly funky, and it looks unrelated (SSD vs. GPU drivers, some delay).

CPU temperature (looks OK)

I’ve tried an unrealistic load on all cores, which got the CPUTIN to 42 °C. After the load ended, the temperature was dropping quickly. (I believe I would get higher temperature after an hour of full load, but still…)

RAM (ECC, hard to test, unlikely cause)

The RAM is DDR4 ECC running with Zen 3 CPU (5800X). Doing a memtest is a tough job – the memory controller should correct memory errors, but it doesn’t tell the software having done so.

Corrected memory faults shouldn’t cause any issue.
Detected memory faults should cause just a crash, not a lockup.
Non-corrected (or improperly corrected) memory faults (as if the memory wasn’t ECC) tend to have more varying symptoms. (I’ve experienced this with an older computer with non-ECC DDR3.)

Swap (not sure)

I originally thought that it couldn’t be related, as a typical OOM doesn’t look like that. However, it depends what process hits the memory limit. If a user space process hits memory limit, it gets killed. However, if a kernel (or kernel module) hits the limit, the situation might be worse.

Also, I have told ChatGPT and Bard to analyse the stacktrace. Bard has highlighted the function ttm_bo_init_validate, which is related to memory allocation. (I’ve independently checked the function documentation.)

Maybe missing swap wasn’t the reason, but I’ve added a swap just to be sure.

GPU (maybe)

Well, I was recently experiencing few issues that point to GPU, most notably few bad glyphs in dom0 (i3wm and XFCE Terminal). Not sure if GPU renders the fonts, but it sounds plausible.

My previous issues (briefly)

This might be unrelated, but I am adding it for context:

Before the lockups, I was getting random reboots. They seem to have been of two categories:

Sudden reboot without any warning. Nothing in system log. Probably increased unsafe shutdown count on SSD. Those were happening even after a MoBo replacement and even in BIOS. Different PSU didn’t help. They have disappeared* after replacing the SSD. It looks like the old SSD has some cold joint causing temporary disconnects.
Sudden rapid slowdown, Wi-Fi disconnected, log messages dom0 kernel: xen-blkback: Scheduled work from previous purge is still busy, cannot purge list (they were stored => SSD wasn’t disconnected completely) and then a reboot (probably by a watchdog). Looks much like sudden PCIe throttling, but with no thermal cause. This issue persisted even after SSD replacement. After removal of the Wi-Fi card, they haven’t reappeared. (Due to the nature of randomly occurring issues, it is too early to say that they have disappeared.) This is a bit WTF, as the card was connected to the NetVM, which quite limits the potential impact. It might have been an EMI, a short circuit or some other low-level issue like some mess with rapid interrupts.

I am not 100% sure that these categories are distinct, i.e. I am not 100% sure that there hasn’t been any reboot with mixed symptoms.

After upgrading the SSD, I’ve also upgraded Qubes to 4.2 RC4 in order to make SSD firmware updates easier (although it didn’t help). I see, I an juggling with too many balls. However, the second type of reboot was occurring even in 4.2 RC4 before I removed the Wi-Fi card

PSU (not sure)

If the issues persist, I can try a different PSU for some time. Not sure if there is something to do at the moment.

*) Well, there was one occurrence of a similar issue when no SSD was connected, but it could be attributed to a specific situation that caused the power cable to be half-inserted. So I don’t count this single case.