opened 04:47PM - 16 Aug 24 UTC
C: other
P: major
hardware support
needs diagnosis
affects-4.2
[How to file a helpful issue](https://www.qubes-os.org/doc/issue-tracking/)
#…## Qubes OS release
4.2.2
### Brief summary
Random GPU hangs during seemingly random and rare times, e.g. once per 2 months, worked-around only with hard resets.
The HCL report of the hardware, on which these anomalies happen is the following, with one exception (more on this later):
```
---
layout:
'hcl'
type:
'Notebook'
hvm:
'yes'
iommu:
'yes'
slat:
'yes'
tpm:
'2.0'
remap:
'yes'
brand: |
LENOVO
model: |
20L6SBJW00
bios: |
N24ET60W (1.35 )
cpu: |
Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
cpu-short: |
FIXME
chipset: |
Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5914] (rev 08)
chipset-short: |
FIXME
gpu: |
Intel Corporation UHD Graphics 620 [8086:5917] (rev 07) (prog-if 00 [VGA controller])
gpu-short: |
FIXME
network: |
Intel Corporation Ethernet Connection (4) I219-LM [8086:15d7] (rev 21)
Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78)
memory: |
65406
scsi: |
usb: |
2
certified:
'no'
versions:
- works:
'FIXME:yes|no|partial'
qubes: |
R4.2.2
xen: |
4.17.4
kernel: |
6.6.42-1
remark: |
FIXME
credit: |
FIXAUTHOR
link: |
FIXLINK
```
The exception is that the kernel used where the hang happened was actually 6.6.36, as the logs say:
```
Aug 11 17:54:17 dom0 kernel: Linux version 6.6.36-1.qubes.fc37.x86_64 (mockbuild@01d867aa44b046b59b72b56f0f81e904) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Tue Jul 2 03:51:16 GMT 2024
```
The logs (with a lot of noise, but preserved as evidence, that I couldn't see anything suspicious on my end) from the point in time where I closed the lid of my laptop to the point when I went back and noticed the unresponsiveness:
(notice the lines with "i915" - more on them later)
```
[user@dom0 ~]$ sudo journalctl --since="2024-08-12 12:39:00" --until="2024-08-12 13:19:00" --no-pager
Aug 12 12:39:21 dom0 systemd-logind[1758]: Lid closed.
Aug 12 12:39:32 dom0 xscreensaver-auth[14684]: PAM unable to dlopen(/usr/lib64/security/pam_sss.so): /usr/lib64/security/pam_sss.so: cannot open shared object file: No such file or directory
Aug 12 12:39:32 dom0 xscreensaver-auth[14684]: PAM adding faulty module: /usr/lib64/security/pam_sss.so
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] Xorg[5133] context reset due to GPU hang
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in Xorg [5133]
Aug 12 13:00:01 dom0 CROND[14739]: (root) CMD (/usr/bin/qvm-sync-clock > /dev/null 2>&1 || true)
Aug 12 13:00:02 dom0 audit[14743]: USYS_CONFIG pid=14743 uid=0 auid=4294967295 ses=4294967295 msg='op=change-system-time exe="/usr/sbin/hwclock" hostname=? addr=? terminal=? res=success'
Aug 12 13:00:02 dom0 kernel: audit: type=1111 audit(1723460402.499:615): pid=14743 uid=0 auid=4294967295 ses=4294967295 msg='op=change-system-time exe="/usr/sbin/hwclock" hostname=? addr=? terminal=? res=success'
Aug 12 13:00:02 dom0 CROND[14738]: (root) CMDEND (/usr/bin/qvm-sync-clock > /dev/null 2>&1 || true)
Aug 12 13:01:01 dom0 CROND[14747]: (root) CMD (run-parts /etc/cron.hourly)
Aug 12 13:01:01 dom0 run-parts[14750]: (/etc/cron.hourly) starting 0anacron
Aug 12 13:01:01 dom0 run-parts[14756]: (/etc/cron.hourly) finished 0anacron
Aug 12 13:01:01 dom0 CROND[14746]: (root) CMDEND (run-parts /etc/cron.hourly)
Aug 12 13:04:50 dom0 qrexec-policy-daemon[2809]: qrexec: qubes.GetDate+nanoseconds: social-media -> @default: allowed to dom0
Aug 12 13:04:50 dom0 audit: BPF prog-id=101 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:616): prog-id=101 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:617): prog-id=102 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:618): prog-id=103 op=LOAD
Aug 12 13:04:50 dom0 audit: BPF prog-id=102 op=LOAD
Aug 12 13:04:50 dom0 audit: BPF prog-id=103 op=LOAD
Aug 12 13:04:50 dom0 systemd[1]: Starting systemd-hostnamed.service - Hostname Service...
Aug 12 13:04:50 dom0 systemd[1]: Started systemd-hostnamed.service - Hostname Service.
Aug 12 13:04:50 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:04:50 dom0 kernel: audit: type=1130 audit(1723460690.150:619): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Aug 12 13:05:20 dom0 kernel: audit: type=1131 audit(1723460720.186:620): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 audit: BPF prog-id=103 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:621): prog-id=103 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:622): prog-id=102 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:623): prog-id=101 op=UNLOAD
Aug 12 13:05:20 dom0 audit: BPF prog-id=102 op=UNLOAD
Aug 12 13:05:20 dom0 audit: BPF prog-id=101 op=UNLOAD
Aug 12 13:18:16 dom0 systemd-logind[1758]: Lid opened.
```
Then, after the GPU hang happened, the Alt+SysRq+K sequence did nothing. Tried multiple times and also made sure that the hotkeys were indeed active:
```
[user@dom0 ~]$ sysctl kernel.sysrq
kernel.sysrq = 4
```
So the problem might be somewhere deeper than just, as I initially thought, a faulty X11 configuration/installation in my case, which happens to be:
```
[user@dom0 ~]$ cat /etc/X11/xorg.conf.d/20-gfx.conf
Section "device"
Identifier "intel"
Driver "modesetting"
Option "AccelMethod" "glamor"
EndSection
```
There's the (closed) issue https://github.com/QubesOS/qubes-issues/issues/7813, which contains the aforementioned lines containing "i915" in them, but this case is different than mine - I have no visual artifacts, only the mere unresponsiveness. Then, in that issue there's the linked comment https://github.com/QubesOS/qubes-issues/issues/7785#issuecomment-1254095362, which describes my case more precisely.
However, since the issue https://github.com/QubesOS/qubes-issues/issues/7785 itself was about Qubes OS 4.1 and 5.15/5.18 kernels, and closed due to a bug about Xorg pages, as the comment at https://github.com/QubesOS/qubes-issues/issues/7785#issuecomment-1320171556 says, where these characteristics are no match for my case, I found opening a new ticket a wiser decision than requesting to reopen the linked one, and modifying the title.
Might be related to https://github.com/QubesOS/qubes-issues/issues/7902, but this ticket is also about Qubes OS 4.1, where I don't recall having any of these GPU hangs, as well as about the i3 window manager. In case it's more appropriate to raise the issues in that ticket, please close this one and let me know, that I should paste my report there.
Might be related to the fixes described in https://wiki.archlinux.org/index.php?title=Intel_graphics&oldid=814542#Crash/freeze_on_low_power_Intel_CPUs, but it can be hard to tell if any of these fixes work, considering how rare and random the GPU hangs can get (e.g. once per 3 months).
I could provide more information, like a kernel dump/backtrace, in case of this hang happening the next time, but I'd request assistance, how should I prepare for it (should I just use kdump or do something else beforehand?), and where exactly can I read that specific information (meaning no verbose noise obscuring the valuable information), that might shed some light on this issue.
### Steps to reproduce
Unknown at the point in time of writing this ticket, and trying to list any might well become fortune telling/providing unrelated noise at best, and misleading information at worst - the  paragraph should be more appropriate in this case.
### Expected behavior
The system works fine without random GPU hangs and forcing the user to perform a hard reset.
### Actual behavior
Random GPU hangs during seemingly random and rare times, e.g. once per 2 months, worked-around only with hard resets.