Hard crashes with kernel-latest (5.8.16)

Hardware:
ASUS ROG Strix B460-F
Corsair Vengeance LPX DDR4 3600MHz 64GB
ASUS GeForce RTX 2060 Super Dual Evo OC2
Intel Core i7 10700K 3.8 GHz

Qubes version: 4.1

Only reason I’m on Qubes 4.1 is that I’m not even able to get the installer running with this hardware.
The 4.1-alpha images works fine if I use the integratedGPU of the CPU.
However, the RTX2060 doesn’t work (gives only weird colored lines on the screen) unless I run kernel-latest (5.8 series).
This has been running fine until some days ago where I noticed probably the 2nd kernel update since last reboot so I decided to reboot to get the new kernel.

After this my system crashes, hard, all the time. Can be during boot, 10mins in to using the computer or even a couple hours in. However eventually the system will crash, in the same manner as if someone pulls out the power cord.
Due to errors on the filesystem I assumed the disk was fu**ed, but it turns out that even after reinstall all works fine until I install kernel-latest.

So question becomes, anyone else seen this kind of behaviour?
What would be the recomended approach going forward?
I’ve considered ditching the RTX2060 and going for an AMD, for example Vega64 or so, however not 100% sure in which kernel version that starts to work properly (current kernel version on 4.1 is 5.4.83).
Maybe someone have other suggestions?

Yeah, for me the 5.9.x kernels have also been highly unstable, crashing within 2h of boot at best. No idea what causes it, because no logs are created that I can see due to the system crashing so badly.
I highly doubt it’s due to the nvidia graphics card, though. 5.8.16 is very stable for me. But you could try disabling the nvidia gpu for now if you want to test this.

your thread title is inaccurate though, most recent two builds of kernel-latest are 5.9.12-1 and 5.9.14-1, and it’s those that crash for you as well, if I understand you right. I’d suggest just not installing and uninstalling those for now.
My system is a x370 prime pro coupled with a ryzen 2700x, as an aside.
vega cards will work fine with kernels like 5.4.x, but that isn’t the issue.

Second question: if you run 4.1 with up to date Xen and kernel 5.8.16, your system is stable?

Yeah its not related to the RTX2060, as the crashes happens with or without the card installed.
Though without the kernel-latest the GPU gives just some funky colored lines on the screen and thus unuseable.

In regards to kernel version, seems I might pulled it from 4.1 stable repo:
Installing:
kernel-latest x86_64 1000:5.8.16-1.qubes qubes-dom0-current 62 M

Not installing the above mentioned kernel gives me a solid and stable system indeed, uname -a in dom0 shows I’m running: 5.4.83-1.qubes.x86_64.
However I’m stuck using the intel graphics of my 10700K and my RTX2060 just collecting dust in the case and thats not why I bought it :slight_smile:
So the idea of GPU change comes from the fact that 5.4.83-1 that I’m on now works very well, except for no external GPU.
So switching to for example AMD Vega56 (or something else supported) would give me just that.
GPU power for Qubes, but also to avoid moving monitor cable over from motherboard to GPU when switching to Windows to game for example.

To your last question, as mentioned above the combo of 4.1 and 5.8.16 is what causes issues for me.
RTX2060 not working and system crashes all from while I type luks decrypt key to a couple hours uptime.

I’ll do a full backup of the system, then try to pull kernel-latest from the testing repo (assume thats where you got 5.9.x kernels from.

ah, okay. I’d assumed you’d be running with current-testing enabled, but I see 5.8.16 got moved to the stable repo recently. My own problems are only with the 5.9 kernel, which indeed is found in current-testing.

Took just about 2hours and I crashed on kernel-latest from testing repo (5.9.14).
So something is very wrong here…

Guess I just need to stick to the intelGPU i the 10700K cpu for now.
Makes me worried for the future though…

I will try to let fedora live with 5.9 kernel run over night to see if the same happens there

Just to report back, have now installed fc33 on to a USB drive, updated the entire system including 5.9 kernel.
Not a hickup of any sorts, been stable since yesterday. Same kernel on qubes = crash over and over.

So this is above my skillevel to figure out, but if someone “senior” has some tests or suggested things to check, I’m more then willing to do whatever it takes to figure this one out

Hi @helge,

how did your system crash? With a frozen display, with a black screen display?

If it’s a frozen display, you can see the last kernel messages with below procedure.

Open a dom0 terminal, run the sudo dmesg -Tw command, and keep this windows always on top (window bar, right click, Always On Top).

On crash, watch this terminal window if you see special output (like a driver crash stack, an error message, …).

Another idea, test your RAM with a dedicated USB Live (ex: SystemRescue) with memtester (in the boot loader menu).

@ludovic it crashes hard, equal to if you just pull the power cord out. 1/10 times I might get a freeze for 1-2secs before machine shuts down and boots back up.

RAM seems ok, if I stick to stable repos 5.4 kernel I have no issues what so ever. Though then I dont have the RTX2060 useable either.
Had the RTX2060 working with kernel-latest until last week.
Something has changed in the recent few dom0 software updates (kernels more than likely) causing this.

Ill install qubes to a temporary disk tomorrow to try and see if I can manage to catch anything with your suggestion.
Usually its just “click” and the machine is off.
Which usually means hardware, but again 5.4 kernel I have abaolutly 0 issues, same with win10 that I dual boot to for gaming somwtimes.

Another ideas :

  1. monitor the temperatures of the motherboard and CPU. A kernel version can include a regression on the temperature management sub-system.
  2. Post-mortem analysis, see the -b -1 option of the journalctl command for getting the previous boot messages (see man journalctl) : sudo journalctl -b -1

Keep open the terminal I suggested, it could help you to see/understand the problem just before the hard crash…

@ludovic
Well well, seems I’m getting somewhere with this issue.
First of all, its very hard to pinpoint exactly what the issue is, but I see the one common factor.
Whenever I boot the PC and the GPU fans aint spinning the crashes WILL happen.
So this points to something overheating and a “selfprotect” kicks in and shuts down the PC.
I never saw the GPU temp above 70C which isn’t too much, but of course the heat from the GPU could overheat other components.

The intel B460 chipset of the motherboard doesn’t seem to have a lot of support in the kernel currently, so I’m not able to get CPU temp out of lm_sensors.
Could be that when the GPU fan aint spinning the CPU fan speed doesn’t adjust according to CPU heat.
Potentially other things, only temp I get from lm_sensors (even after detect) is GPU, NVMe disk temp and acpitz-acpi-0 (not sure what this refers to).

Could of course also be that sensors is picking up a different temp sensor on the GPU and due to the fans not spinnig another part of the GPU overheats and shuts down the system.

Anyhow, whenever I boot up I check if the fans are spinning, if they aint I shut down and boot again.
Whenever “sensors” show fan speed the system stays stable.

Often, the BIOS settings allow different fan profiles (performance, silent, ultra-quiet, …), it’s a possibility for you…

For your NVME SSD temperature, the smartctl command monitors a lot of information (including the temperature), see the man page.

If your motherboard includes an internal GPU, remove your PCI GPU (GeForce), the time to find the computer crash origin.

I’m aware, but something tells me software is overriding somehow. Since when at 5.8.16 GPU fans sometimes doesn’t spin up no matter the temp on the GPU.
Whenever the GPU fans are not spinning the crash happens, a lot. Wether its the GPU causing the failure or if something else isn’t working as well causing the crash is unsure.
Problem is that “lm-sensors” doesn’t give me CPU temp and even unsure of the GPU temp its giving me is the correct one.

To add to the confusion, the crashes happens even if the GPU is not installed. Which tells me its more than just the GPU fans that are having an issue.
Everything works 100% at every boot on 5.4.x kernel, but 1/4 boots will leave me with GPU fans not spinning (regardless of temp) on 5.8.16.
On those boots the system will crash a lot, otherwise it works perfectly fine (as in, when the fans spins right from the bootup).

For now I’ll leave it as is, just checking “sensors” to see the fans are spinning after boot, then all is good and stable :slight_smile:
I’ll pursue the matter again if this continue to be an issue in the next kernel upgrades.

Ok, just thought I’d update the thread on my issue in case someone else come in to similar situation.
Wouldn’t be surprised you end up here if your running a newer NVIDIA GPU with Intel B460 chipset motherboard.
I see several posts around in regards to poor support for the 460 chipset all together (linux kernel), not to mention XEN kernel and newer NVIDIA GPU.

Anyhow my procedure, which is quite a laugh, is as follows…
Until I booted today I had 12 days uptime, so I know I get the system stable if I just do these steps.
And here comes the crazy/laughable part, the GPU needs to be primed temp wise.
The fan speed will not adjust no matter what after boot, so in other words, the GPU will set fan speeds based on temp at boot time and there they are stuck.

So what I need to do is boot up, fanspeed = 0 RPM. I wait until GPU temp gets up to about 55-58C, then I shutdown and immidatly boot up again.
Fanspeed will then be set to about ~900RPM which aint too noisy and keeps temp around 35-40C depending on what I do.
If I do the same procedure, but I let the GPU temp get up to 70C before I boot, the fanspeed will be around 2000RPM which is quite noisy and as mentioned above, the fanspeed will stay there even GPU temp is down at 30C.

Hopefully one day this gets resolved in an update, cause rebooting sure is a bit of a pain :stuck_out_tongue: