QubesOS freeze, crash and reboots

ruza · October 3, 2022, 7:35pm

afaik “kernel used is qubes” in Global settings corresponds to kernel-qubes-vm packages installed. kernel-latest is a kernel for dom0.

renehoj · October 4, 2022, 11:15am

How did the retbleed mitigation impact haswell/sandy?

I think the patch was released the same week as 4.1.1, and it did have a noticeable impact on some systems.

I don’t think it’s crashing your computer, but it could be part of the reason why it’s getting hotter.

Sven · October 4, 2022, 11:33pm

Today another freeze, once again during update. This time however the log yielded a hint I hope, which I posted into the existing issue #7693.

enmus · October 6, 2022, 8:17pm

Glitches again for the whole day (intentionally not restarting Qubes), but no freezing so far. Will try to update dom0 now to provoke freeze.
If I don’t come back tell the devs I loved them anyways.

mono · October 11, 2022, 5:42am

Had same experience over the summer, run testing, kernel latest due to machine being newer hardware, which is now 2 years old but that doesn’t mean anything in xen world.

sys-usb-dvm does not always find mouse and i have to restart the dvm
fractals in the tray sometimes
it seemed that updating VMs right when the Star sign lights up was a bad idea, would crash to login screen with appVMs still running

dom0 widget-wrapper[14221]: python3: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
dom0 systemd-coredump[28361]: Process 10756 (xss-lock) of user 1000 dumped core.
dom0 qrexec-policy-e[28371]: error calling qrexec-policy-agent in dom0
Traceback (most recent call last):
File “/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py”, line 133, in execute
await super().execute(caller_ident)
File “/usr/lib/python3.8/site-packages/qrexec/policy/parser.py”, line 556, in execute
raise ExecutionFailed(‘qrexec-client failed: {}’.format(command))
qrexec.exc.ExecutionFailed: qrexec-client failed: [‘/usr/lib/qubes/qrexec-client’, ‘-d’, ‘dom0’, ‘-c’, ‘SOCKET12,sys-net,1’, ‘-E’, ‘QUBESRPC qubes.WindowIconUpdater+ sys-net keyword adminvm’]
                                         During handling of the above exception, another exception occurred:
                                         
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 151, in notify
                                             await call_socket_service(guivm, service, source_domain, params)
                                           File "/usr/lib/python3.8/site-packages/qrexec/server.py", line 105, in call_socket_service_local
                                             reader, writer = await asyncio.open_unix_connection(path)
                                           File "/usr/lib64/python3.8/asyncio/streams.py", line 111, in open_unix_connection
                                             transport, _ = await loop.create_unix_connection(
                                           File "/usr/lib64/python3.8/asyncio/unix_events.py", line 244, in create_unix_connection
                                             await self.sock_connect(sock, path)
                                           File "/usr/lib64/python3.8/asyncio/selector_events.py", line 496, in sock_connect
                                             return await fut
                                           File "/usr/lib64/python3.8/asyncio/selector_events.py", line 501, in _sock_connect
                                             sock.connect(address)
                                         ConnectionRefusedError: [Errno 111] Connection refused
dom0 qrexec[28371]: qubes.WindowIconUpdater: sys-net → dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes

dom0 lvm[4480]: No longer monitoring thin pool qubes_dom0-vm–pool-tpool.
dom0 lvm[4480]: Monitoring thin pool qubes_dom0-vm–pool-tpool.
dom0 systemd-coredump[15815]: Process 10276 (Xorg) of user 0 dumped core.
                                          Stack trace of thread 10276:

Just a few significant looking red parts of journactl.

My conclusion was, wait before updating the system till it goes into a more idle state and that either Heat spikes or some sort of bottleneck wanting data from my nvme causes to login crash.
Not sure how to test the effectiveness of my nvme + current state of nvme drives getting slower over time (there might or there might not be a fw patch that actually works).

Waiting before instant firing up updates seemed to do the trick.
Seemed because there were times where it did not, though rather seldomly.

Maybe i should post the journactl errors on github even though i fear my mobo might be to new, is there already a thread?

djb · October 12, 2022, 5:49pm

Upgraded a Lenovo X1 Carbon Gen8 from 4.0 to 4.1 via qubes-dist-upgrade. Like other people, rapidly encountered serious failures (hangs/reboots) under mild load (plus graphics problems reported elsewhere, apparently fixed with the intel driver and i915.force_probe=* as in https://github.com/Qubes-Community/Contents/blob/master/docs/troubleshooting/intel-igfx-troubleshooting.md). Tried a few random combinations of workarounds suggested in various threads, and so far haven’t been able to trigger a failure with the following combination: dom0 kernel 5.4 (which at the moment means 5.4.203) + turning off swap (swapoff -a) in dom0 as suggested in https://forum.qubes-os.org/t/experiencing-frequent-kernel-hangs-on-qube-4-1-with-5-4-80-1-qubes-kernel/2187.

tanky0u · October 13, 2022, 6:52pm

I had a few days and overnights without freezes (which really impressed and surprised me).
However, today I had two freezes and both within some short amount of time after system boot (generally that would happen after a few hours of idling).

djb · October 15, 2022, 8:57pm

Now had a freeze, no obvious trigger, after a few days with 5.4 without swap. Will try 5.18 for a bit.

enmus · October 15, 2022, 10:59pm

No freezes for more than 14 days, since that update.

tripleh · October 16, 2022, 8:33am

The 5.19 line is known for “numerous issues” according to qubes-devel.

Anyway the 5.10 LTS kernel line (qubes-dom0-update kernel-510) runs fairly stable for me. The newer ones just crash whenever sys-usb is started.

enmus · October 16, 2022, 9:10am

I’m not aware of it, and what you say actually might confirm that issues aren’t related to kernels, as I suspected, but probably with Qubes and Xen?
It’s the simple fact that issues began this summer, and all kernels >= 5.10 until that point (I can’t remember which was the last, let’s say some 5.15.64) never produced issues of this topic, but now even <=5.15 kernel aren’t stable?

tripleh · October 16, 2022, 10:31am

Well, I guess the issues are related to a combination of Xen version & Linux kernel version. I’m not sure whether it’s any particular fault with the Qubes OS devs apart from them adopting new dom0 kernel & Xen versions pretty quickly. This helps with newer hardware, but may well break older.

On my T530 Xen 4.14.5 and dom0 Linux kernel 5.10.136.1 work relatively stable, even if performance is still way worse than it was with 4.0 (but that’s another topic). Newer kernel versions don’t work for me, but I’m experiencing a different issue than you (dom0 crash on sys-usb start apparently due to a very specific piece of hardware attached to sys-usb).

From all the topics on the forum and qubes-issues I currently feel that with 4.1 every user may have to experiment to find a working dom0 kernel for his/her hardware… it certainly shouldn’t be that way, but I fear these are upstream Xen or Linux regressions.
I’m also under the impression that the Qubes devs have little time for these very hardware specific topics as they are very time intensive to debug (+ hardware costs). Both time and money are apparently scarce on the Qubes OS project unfortunately. E.g. I’m not sure who of the devs still has funding to work full time on the project… Considering the relatively small dev base of the project, it’s actually amazing that it’s still alive and I’m very grateful for that.

enmus · October 16, 2022, 10:59am

Beside everything else you wrote, I particularly agree with this perspective, actually.

taradiddles · October 16, 2022, 1:15pm

(posting here after seeing your recent post on qubes-devel)

I have a T450s that I bought brand new in 2016 with the best model specs available, exclusively for use with Qubes OS - I haven’t used anything other than Qubes OS since then. With 3.x and 4.0 I seldom had issues (maybe the occasional hard freeze) but since a few months (unfortunately can’t really tell when - probably since Feb/March) I get display corruption/glitches when the laptop’s undocked, so bad that it’s impractical to write in a vm’s terminal (no corruption in dom0 though) - reminds me of writing blindly trying to anticipate return packets when using a 1200 baud modem back in the day. I tried switching to the intel driver which fixed the corruption but I’d then get hard freezes so often that I reverted back to fbdev. The level of screen corruption varies and I couldn’t find any pattern but it seems that it’s a bit better after a reboot. Those days my laptop is docked 99% of the time so I didn’t try to really investigate this - but it’s clearly an issue.
Also - since a bit of time I can’t work for more than 5-15 minutes with libreoffice writer: all the vm’s windows disappear (yet the vm is functional with qvm-* commands in dom0). I’ve lost hours of work because of that and as as stupid workaround I’ve set libreoffice’s autosave to 1 minute which kills my laptop when working with large docs.
I haven’t found a pattern but it seems that anything graphically intensive crashes the vm’s gui - or worse, triggers a hard freeze. When a vm’s gui crashes guid.*.log usually shows “XshmAttach failed for window […]”.
I’m also getting more random hard freezes than before - not only with graphic intensive app - but nothing reproducible. There’s indeed a feeling as a whole that the current qubes os version isn’t as stable as before. I took the “lazy” approach of waiting for someone to fix/report this - given that T450s’ are common (I think @adw has/had one) - but it didn’t happen (+ I don’t have much time those days to spend on debugging stuff). I’d be happy to help if needed.

enmus · October 16, 2022, 4:03pm

Just had dom0 updates now:

Summary

Updating dom0

local:
    ----------
    kernel:
        ----------
        new:
            1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes,1000:5.15.74-1.fc32.qubes
        old:
            1000:5.15.63-1.fc32.qubes,1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes
    kernel-qubes-vm:
        ----------
        new:
            1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes,1000:5.15.74-1.fc32.qubes
        old:
            1000:5.15.63-1.fc32.qubes,1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes
    python3-xen:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-hypervisor:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-libs:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-licenses:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-runtime:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32

Worthy to follow if something will change with crashes now…

fsflover · October 16, 2022, 7:41pm

Wierd, I am not experiencing any of the problems listed above on my Librem 15 with R4.1 and 32 GB of RAM. No crashes, freezes, reboots. Only slowness like this.

adw · October 18, 2022, 7:53am

Sorry, I haven’t experienced any freezing or crashes on my T450s with Qubes 4.1. Sometimes it feels a bit slower than 4.0. (But that’s just a feeling. I haven’t done any scientific testing.) My usage seems somewhat light compared to a lot of power users and developers around here, though.

seberm · October 19, 2022, 11:15am

Hello everyone,
I started to have very similar issues/crashes as you have on my Thinkpad P1 gen3 laptop:

$ lspci | grep VGA
- 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05)

$ uname -r
5.18.16.1.fc32.qubes.x86_64

The trace I am getting is following:

Oct 17 15:47:53 dom0 kernel: BUG: Bad page map: 992 messages suppressed
Oct 17 15:47:53 dom0 kernel: BUG: Bad page map in process Xorg  pte:8000000adaf14365 pmd:135f77067
Oct 17 15:47:53 dom0 kernel: page:0000000074dfd1dd refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x195d94
Oct 17 15:47:53 dom0 kernel: flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
Oct 17 15:47:53 dom0 kernel: raw: 0027ffffc0003408 ffff88810376d300 ffffea0006576540 0000000000000000
Oct 17 15:47:53 dom0 kernel: raw: 0000000000000000 0000134500000007 00000401fffffffe 0000000000000000
Oct 17 15:47:53 dom0 kernel: page dumped because: bad pte
Oct 17 15:47:53 dom0 kernel: addr:00007ed79a637000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888100290508 index:7af
Oct 17 15:47:53 dom0 kernel: file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
Oct 17 15:47:53 dom0 kernel: CPU: 2 PID: 6715 Comm: Xorg Tainted: G    B   W         5.18.16-1.fc32.qubes.x86_64 #1
Oct 17 15:47:53 dom0 kernel: Hardware name: LENOVO 20TJS2F44A/20TJS2F44A, BIOS N2VET37W (1.22 ) 01/18/2022
Oct 17 15:47:53 dom0 kernel: Call Trace:
Oct 17 15:47:53 dom0 kernel:  <TASK>
Oct 17 15:47:53 dom0 kernel:  dump_stack_lvl+0x45/0x5e
Oct 17 15:47:53 dom0 kernel:  print_bad_pte.cold+0x6a/0xc5
Oct 17 15:47:53 dom0 kernel:  zap_pte_range+0x430/0x8b0
Oct 17 15:47:53 dom0 kernel:  ? __raw_callee_save_xen_pmd_val+0x11/0x22
Oct 17 15:47:53 dom0 kernel:  zap_pmd_range.isra.0+0x1b8/0x2f0
Oct 17 15:47:53 dom0 kernel:  zap_pud_range.isra.0+0xa9/0x1e0
Oct 17 15:47:53 dom0 kernel:  unmap_page_range+0x16c/0x200
Oct 17 15:47:53 dom0 kernel:  unmap_vmas+0x83/0x100
Oct 17 15:47:53 dom0 kernel:  unmap_region+0xbd/0x120
Oct 17 15:47:53 dom0 kernel:  __do_munmap+0x177/0x350
Oct 17 15:47:53 dom0 kernel:  __vm_munmap+0x75/0x120
Oct 17 15:47:53 dom0 kernel:  __x64_sys_munmap+0x17/0x20
Oct 17 15:47:53 dom0 kernel:  do_syscall_64+0x59/0x90
Oct 17 15:47:53 dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Oct 17 15:47:53 dom0 kernel: RIP: 0033:0x7ed7a34e237b
Oct 17 15:47:53 dom0 kernel: Code: 8b 15 21 6b 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed 6a 0c 00 f7 d8 64 89 01 48
Oct 17 15:47:53 dom0 kernel: RSP: 002b:00007fff098b4488 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Oct 17 15:47:53 dom0 kernel: RAX: ffffffffffffffda RBX: 0000000000000055 RCX: 00007ed7a34e237b
Oct 17 15:47:53 dom0 kernel: RDX: 00007fff098b44a0 RSI: 0000000000055000 RDI: 00007ed79a637000
Oct 17 15:47:53 dom0 kernel: RBP: 00007ed79a637000 R08: 0000000000000008 R09: 0000000000000000
Oct 17 15:47:53 dom0 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000009
Oct 17 15:47:53 dom0 kernel: R13: 00006080c7ff49d8 R14: 000000000000005f R15: 00006080c6bc9e00
Oct 17 15:47:53 dom0 kernel:  </TASK>

These problems started to pop up when:

I switched my window manager from xfce4 to i3
Then I had started to see small random graphical artifacts, so I switched the Xorg driver from modesetting to intel according to the following issue: Use generic modesetting driver instead i915/i965 as default · Issue #4782 · QubesOS/qubes-issues · GitHub
Artifacts were gone, but my system have started crashing.

Right now I have switched back to xfce4 with modesetting driver and everything seems to be back in normal.

unman · October 19, 2022, 12:17pm

I think that is not the same issue that affects other users, although it
may reflect some common underlying problem.

howfuniscrashing · October 20, 2022, 12:32am

So where exactly are we at with this? Has anyone got official word from any of the developers? To me the only thing more unacceptable than foolishly releasing an operating system that has blatant stability issues is to leave users completely in the dark while doing so. Perhaps I am just not active in the right channels, in which case can somebody please point me to where I can get updates on actual work that is being done to solve this problem?

The frustrating thing is that I’ve spent so much time adjusting my workflow to fit in with Qubes that I now struggle very much to use a “traditional” operating system. I can’t just accept that crashing every few hours is acceptable though. Stuck between a rock and a hard place.

You have to wonder how it would be possible for Qubes to release 4.1.1 without testing it? It’s not like this is the usual hardware issues that people face with Qubes, this is certified hardware regressing to the point of being unusable.

Who provides funding to Qubes? Who are the investors? Surely anyone who has a interest in the success of this operating system would be appalled at what is going on.

Can anyone who has previously experienced crashes and are now no longer experiencing crashes post here? I think I will need to purchase a new laptop and would like to get some hardware recommendations.