I have i5-1245U (laptop was used), I’ve changed heatpipe/fan module to gpu version and changed thermal paste. I’ve changed keyboard also from nordic to international. And changed nvme to FireCuda 530. That’s all modifications.
As for system, default xorg.conf
was with i915 driver and playback was choppy on any fullscreen with around 50-60% cpu utilization and 500-1000ms stop frame every 3-4s.
After changing xorg.conf
to:
Section "Module"
Load 'glamoregl"
EndSection
Section "Device"
Identifier "Intel Graphics"
Driver "modesetting"
EndSection
It’s much better but I have hard crash of dom0
once a day.
It’s dri = iris
fault since it’s forced by modesetting
driver but I’ve tested it with i915
and is the same.
Any other options in xorg.conf
is either depreciated or changed to modesetting
default so no need to include it.
opened 04:47PM - 16 Aug 24 UTC
C: other
P: major
hardware support
needs diagnosis
affects-4.2
[How to file a helpful issue](https://www.qubes-os.org/doc/issue-tracking/)
#… ## Qubes OS release
4.2.2
### Brief summary
Random GPU hangs during seemingly random and rare times, e.g. once per 2 months, worked-around only with hard resets.
The HCL report of the hardware, on which these anomalies happen is the following, with one exception (more on this later):
```
---
layout:
'hcl'
type:
'Notebook'
hvm:
'yes'
iommu:
'yes'
slat:
'yes'
tpm:
'2.0'
remap:
'yes'
brand: |
LENOVO
model: |
20L6SBJW00
bios: |
N24ET60W (1.35 )
cpu: |
Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
cpu-short: |
FIXME
chipset: |
Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5914] (rev 08)
chipset-short: |
FIXME
gpu: |
Intel Corporation UHD Graphics 620 [8086:5917] (rev 07) (prog-if 00 [VGA controller])
gpu-short: |
FIXME
network: |
Intel Corporation Ethernet Connection (4) I219-LM [8086:15d7] (rev 21)
Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78)
memory: |
65406
scsi: |
usb: |
2
certified:
'no'
versions:
- works:
'FIXME:yes|no|partial'
qubes: |
R4.2.2
xen: |
4.17.4
kernel: |
6.6.42-1
remark: |
FIXME
credit: |
FIXAUTHOR
link: |
FIXLINK
```
The exception is that the kernel used where the hang happened was actually 6.6.36, as the logs say:
```
Aug 11 17:54:17 dom0 kernel: Linux version 6.6.36-1.qubes.fc37.x86_64 (mockbuild@01d867aa44b046b59b72b56f0f81e904) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Tue Jul 2 03:51:16 GMT 2024
```
The logs (with a lot of noise, but preserved as evidence, that I couldn't see anything suspicious on my end) from the point in time where I closed the lid of my laptop to the point when I went back and noticed the unresponsiveness:
(notice the lines with "i915" - more on them later)
```
[user@dom0 ~]$ sudo journalctl --since="2024-08-12 12:39:00" --until="2024-08-12 13:19:00" --no-pager
Aug 12 12:39:21 dom0 systemd-logind[1758]: Lid closed.
Aug 12 12:39:32 dom0 xscreensaver-auth[14684]: PAM unable to dlopen(/usr/lib64/security/pam_sss.so): /usr/lib64/security/pam_sss.so: cannot open shared object file: No such file or directory
Aug 12 12:39:32 dom0 xscreensaver-auth[14684]: PAM adding faulty module: /usr/lib64/security/pam_sss.so
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] Xorg[5133] context reset due to GPU hang
Aug 12 12:40:30 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in Xorg [5133]
Aug 12 13:00:01 dom0 CROND[14739]: (root) CMD (/usr/bin/qvm-sync-clock > /dev/null 2>&1 || true)
Aug 12 13:00:02 dom0 audit[14743]: USYS_CONFIG pid=14743 uid=0 auid=4294967295 ses=4294967295 msg='op=change-system-time exe="/usr/sbin/hwclock" hostname=? addr=? terminal=? res=success'
Aug 12 13:00:02 dom0 kernel: audit: type=1111 audit(1723460402.499:615): pid=14743 uid=0 auid=4294967295 ses=4294967295 msg='op=change-system-time exe="/usr/sbin/hwclock" hostname=? addr=? terminal=? res=success'
Aug 12 13:00:02 dom0 CROND[14738]: (root) CMDEND (/usr/bin/qvm-sync-clock > /dev/null 2>&1 || true)
Aug 12 13:01:01 dom0 CROND[14747]: (root) CMD (run-parts /etc/cron.hourly)
Aug 12 13:01:01 dom0 run-parts[14750]: (/etc/cron.hourly) starting 0anacron
Aug 12 13:01:01 dom0 run-parts[14756]: (/etc/cron.hourly) finished 0anacron
Aug 12 13:01:01 dom0 CROND[14746]: (root) CMDEND (run-parts /etc/cron.hourly)
Aug 12 13:04:50 dom0 qrexec-policy-daemon[2809]: qrexec: qubes.GetDate+nanoseconds: social-media -> @default: allowed to dom0
Aug 12 13:04:50 dom0 audit: BPF prog-id=101 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:616): prog-id=101 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:617): prog-id=102 op=LOAD
Aug 12 13:04:50 dom0 kernel: audit: type=1334 audit(1723460690.049:618): prog-id=103 op=LOAD
Aug 12 13:04:50 dom0 audit: BPF prog-id=102 op=LOAD
Aug 12 13:04:50 dom0 audit: BPF prog-id=103 op=LOAD
Aug 12 13:04:50 dom0 systemd[1]: Starting systemd-hostnamed.service - Hostname Service...
Aug 12 13:04:50 dom0 systemd[1]: Started systemd-hostnamed.service - Hostname Service.
Aug 12 13:04:50 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:04:50 dom0 kernel: audit: type=1130 audit(1723460690.150:619): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Aug 12 13:05:20 dom0 kernel: audit: type=1131 audit(1723460720.186:620): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 12 13:05:20 dom0 audit: BPF prog-id=103 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:621): prog-id=103 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:622): prog-id=102 op=UNLOAD
Aug 12 13:05:20 dom0 kernel: audit: type=1334 audit(1723460720.231:623): prog-id=101 op=UNLOAD
Aug 12 13:05:20 dom0 audit: BPF prog-id=102 op=UNLOAD
Aug 12 13:05:20 dom0 audit: BPF prog-id=101 op=UNLOAD
Aug 12 13:18:16 dom0 systemd-logind[1758]: Lid opened.
```
Then, after the GPU hang happened, the Alt+SysRq+K sequence did nothing. Tried multiple times and also made sure that the hotkeys were indeed active:
```
[user@dom0 ~]$ sysctl kernel.sysrq
kernel.sysrq = 4
```
So the problem might be somewhere deeper than just, as I initially thought, a faulty X11 configuration/installation in my case, which happens to be:
```
[user@dom0 ~]$ cat /etc/X11/xorg.conf.d/20-gfx.conf
Section "device"
Identifier "intel"
Driver "modesetting"
Option "AccelMethod" "glamor"
EndSection
```
There's the (closed) issue https://github.com/QubesOS/qubes-issues/issues/7813, which contains the aforementioned lines containing "i915" in them, but this case is different than mine - I have no visual artifacts, only the mere unresponsiveness. Then, in that issue there's the linked comment https://github.com/QubesOS/qubes-issues/issues/7785#issuecomment-1254095362, which describes my case more precisely.
However, since the issue https://github.com/QubesOS/qubes-issues/issues/7785 itself was about Qubes OS 4.1 and 5.15/5.18 kernels, and closed due to a bug about Xorg pages, as the comment at https://github.com/QubesOS/qubes-issues/issues/7785#issuecomment-1320171556 says, where these characteristics are no match for my case, I found opening a new ticket a wiser decision than requesting to reopen the linked one, and modifying the title.
Might be related to https://github.com/QubesOS/qubes-issues/issues/7902, but this ticket is also about Qubes OS 4.1, where I don't recall having any of these GPU hangs, as well as about the i3 window manager. In case it's more appropriate to raise the issues in that ticket, please close this one and let me know, that I should paste my report there.
Might be related to the fixes described in https://wiki.archlinux.org/index.php?title=Intel_graphics&oldid=814542#Crash/freeze_on_low_power_Intel_CPUs, but it can be hard to tell if any of these fixes work, considering how rare and random the GPU hangs can get (e.g. once per 3 months).
I could provide more information, like a kernel dump/backtrace, in case of this hang happening the next time, but I'd request assistance, how should I prepare for it (should I just use kdump or do something else beforehand?), and where exactly can I read that specific information (meaning no verbose noise obscuring the valuable information), that might shed some light on this issue.
### Steps to reproduce
Unknown at the point in time of writing this ticket, and trying to list any might well become fortune telling/providing unrelated noise at best, and misleading information at worst - the  paragraph should be more appropriate in this case.
### Expected behavior
The system works fine without random GPU hangs and forcing the user to perform a hard reset.
### Actual behavior
Random GPU hangs during seemingly random and rare times, e.g. once per 2 months, worked-around only with hard resets.
I’ve also played with cpu management but making it use higher power states make cpu power throttle and going to lower frequency than with balanced settings.
UDEV in SUBSYTEM=power_supply shows various event, even usbc connect and disconnect.
So maybe it would be better if 90-on-battery.rules would monitor AC Mains not battery charged.
Your syslog shows that after successful setting parameters for AC it sets parameters for Battery at the end.
Another thing - with my test’s and some article on APM/ACPI it’s not too good to force CPU to low frequency because when it can’t bump frequency to do the job it stays longer in high power state and consume …
1 Like