QubesOS freeze, crash and reboots

@0x9060 instead of moving your post into it’s own thread let me try and
say this:

  1. your post has nothing to do with the topic of this thread
  2. your sentiments about RAM are echoed in countless other threads which
    you can easily find and participate in using the forum’s search
  3. if you are serious about “walled garden” please start a new thread
    under ‘General Discussion’ and try to make an actual argument

Others: if you feel the need to respond, please do so in a separate
thread and keep this one focused on the topic at hand. Thank you!

2 Likes

For the record I just had another major freeze and reboot using kernel 5.15.64. I was simply moving a window (I use the i3 twm). Logs show the same as reported above: massive Xorg crash/bug:

– Reboot –
Oct 01 09:25:10 dom0 kernel: addr:000072ae93871000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff88810186dca8 index:1c54
Oct 01 09:25:10 dom0 kernel: page dumped because: bad pte
Oct 01 09:25:10 dom0 kernel: raw: 0000000000000000 0000540000000007 00000001fffffffe 0000000000000000
Oct 01 09:25:10 dom0 kernel: raw: 0027ffffc0003408 ffff8881156f5180 ffffea0004a4e000 0000000000000000
Oct 01 09:25:10 dom0 kernel: flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
Oct 01 09:25:10 dom0 kernel: page:00000000e5385bb0 refcount:1 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x12937f
Oct 01 09:25:10 dom0 kernel: BUG: Bad page map in process Xorg pte:80000003445e0365 pmd:10b9fe067
Oct 01 09:25:10 dom0 kernel:
Oct 01 09:25:10 dom0 kernel: R13: 0000563f5af5f058 R14: 0000000000000069 R15: 0000563f59749e00
Oct 01 09:25:10 dom0 kernel: R10: 000072aeab02e000 R11: 0000000000000206 R12: 0000000000000009
Oct 01 09:25:10 dom0 kernel: RBP: 000072ae93870000 R08: 0000000000000008 R09: 0000000000000000
Oct 01 09:25:10 dom0 kernel: RDX: 00007fff4ea73850 RSI: 00000000003c8000 RDI: 000072ae93870000
Oct 01 09:25:10 dom0 kernel: RSP: 002b:00007fff4ea73838 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Oct 01 09:25:10 dom0 kernel: Code: 8b 15 21 6b 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f>
Oct 01 09:25:10 dom0 kernel: RIP: 0033:0x72aeaf35d37b
Oct 01 09:25:10 dom0 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Oct 01 09:25:10 dom0 kernel: do_syscall_64+0x38/0x90
Oct 01 09:25:10 dom0 kernel: __x64_sys_munmap+0x28/0x40
Oct 01 09:25:10 dom0 kernel: __vm_munmap+0x75/0x120
Oct 01 09:25:10 dom0 kernel: __do_munmap+0x1f5/0x4e0
Oct 01 09:25:10 dom0 kernel: unmap_region+0xbd/0x120
Oct 01 09:25:10 dom0 kernel: unmap_vmas+0x83/0x100
Oct 01 09:25:10 dom0 kernel: unmap_page_range+0x17a/0x210
Oct 01 09:25:10 dom0 kernel: zap_pud_range.isra.0+0xaa/0x1e0
Oct 01 09:25:10 dom0 kernel: zap_pmd_range.isra.0+0x1cc/0x2d0
Oct 01 09:25:10 dom0 kernel: ? __raw_callee_save_xen_pmd_val+0x11/0x22
Oct 01 09:25:10 dom0 kernel: zap_pte_range+0x388/0x7d0
Oct 01 09:25:10 dom0 kernel: print_bad_pte.cold+0x6a/0xc5
Oct 01 09:25:10 dom0 kernel: dump_stack_lvl+0x46/0x5e
Oct 01 09:25:10 dom0 kernel:

…etc

As I said, it’s most probably not related to kernel, but to Qubes and/or Xen and something with gui

I also had these errors when freezing, having issues

[user@dom0 ~]$ sudo journalctl --reverse | grep “GPU hang”
Sep 29 xx:xx:29 dom0 kernel: i915 0000:00:0xx.0: [drm] Xorg[4220] context reset due to GPU hang

But after recent update of dom0 from later that day (stubdom-linux, xen and other major updates) haven’t faced it so far while on kernel 5.19.9-1

I am on librem 14 and its crashing first keyboard always stop responding and then its down hill task bar disappears etc i have power off and on. Used reliably for over a year it just started happening past few days.

This problem is really bad my librem is crashing 10 time a day!!!

My librem is now even freezing on the boot up i get half way entering my disk password and it freezes

Well on the weekend it happened again, this time not during an update but when starting a qube …

-- Reboot --
Oct 01 23:51:03 dom0 qubesd[1393]: Renamed file: '/var/lib/qubes/appvms/social/root-dirty.img~g5cwbnid' -> '/var>
Oct 01 23:51:03 dom0 qubesd[1393]: Reflinked file: '/var/lib/qubes/appvms/social/root.img' -> '/var/lib/qubes/ap>
Oct 01 23:51:02 dom0 qubesd[1393]: Renamed file: '/var/lib/qubes/appvms/social/volatile-dirty.img~ku9xxxni' -> '>
Oct 01 23:51:02 dom0 qubesd[1393]: Hardlinked file: '/var/lib/qubes/vm-templates/debian-11-web/root.img' -> '/va>
Oct 01 23:51:02 dom0 qubesd[1393]: Renamed file: '/var/lib/qubes/appvms/social/private-precache.img' -> '/var/li>
Oct 01 23:51:02 dom0 qubesd[1393]: Created sparse file: '/var/lib/qubes/appvms/social/volatile-dirty.img~ku9xxxn>
Oct 01 23:51:02 dom0 qubesd[1393]: vm.social: Starting social
Oct 01 23:51:02 dom0 qrexec-policy-daemon[1490]: qrexec: qubes.OpenURL+: mail -> @dispvm: allowed to social
Oct 01 23:47:41 dom0 qrexec-policy-daemon[1490]: qrexec: qubes.OpenInVM+: brain -> @dispvm: allowed to offline

Maybe one of the issues we are seeing here (I am convinced that there are several issues mixed together in this thread) has to do with starting qubes/VMs which obviously happens several times when running an update.

There is that script for testing startup speed that starts the minimal vm 10 times, you can try and do something similar to confirm starting a qube triggers the error.

1 Like

I experienced freeze and reboots on my Q4.1 few times a day with dom0 kernel 5.15.x. Such buggy OS is becoming unusable from the users point of view. Up until now it seems to be fixed for me by installing latest dom0 kernel 5.18.

qubes-dom0-update kernel-latest

1 Like

I am.unable to run my librem more than 30min without crash sometimes.it does.not get pass entering fde pasaword

I noticed correlation between keyboard backlight turning on and crashes. I never turn on it does by itself

I ran this command on 4.1.1 it downloaded something but says still in global settongs kernal available is only 5.15 it says kernel is installed but when i go to qubes global settings only shiws 5.15 not 5.18

afaik “kernel used is qubes” in Global settings corresponds to kernel-qubes-vm packages installed. kernel-latest is a kernel for dom0.

How did the retbleed mitigation impact haswell/sandy?

I think the patch was released the same week as 4.1.1, and it did have a noticeable impact on some systems.

I don’t think it’s crashing your computer, but it could be part of the reason why it’s getting hotter.

Today another freeze, once again during update. This time however the log yielded a hint I hope, which I posted into the existing issue #7693.

Glitches again for the whole day (intentionally not restarting Qubes), but no freezing so far. Will try to update dom0 now to provoke freeze.
If I don’t come back tell the devs I loved them anyways. :rofl:

Had same experience over the summer, run testing, kernel latest due to machine being newer hardware, which is now 2 years old but that doesn’t mean anything in xen world.

  • sys-usb-dvm does not always find mouse and i have to restart the dvm
  • fractals in the tray sometimes
  • it seemed that updating VMs right when the Star sign lights up was a bad idea, would crash to login screen with appVMs still running

dom0 widget-wrapper[14221]: python3: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
dom0 systemd-coredump[28361]: Process 10756 (xss-lock) of user 1000 dumped core.
dom0 qrexec-policy-e[28371]: error calling qrexec-policy-agent in dom0
Traceback (most recent call last):
File “/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py”, line 133, in execute
await super().execute(caller_ident)
File “/usr/lib/python3.8/site-packages/qrexec/policy/parser.py”, line 556, in execute
raise ExecutionFailed(‘qrexec-client failed: {}’.format(command))
qrexec.exc.ExecutionFailed: qrexec-client failed: [‘/usr/lib/qubes/qrexec-client’, ‘-d’, ‘dom0’, ‘-c’, ‘SOCKET12,sys-net,1’, ‘-E’, ‘QUBESRPC qubes.WindowIconUpdater+ sys-net keyword adminvm’]

                                         During handling of the above exception, another exception occurred:
                                         
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 151, in notify
                                             await call_socket_service(guivm, service, source_domain, params)
                                           File "/usr/lib/python3.8/site-packages/qrexec/server.py", line 105, in call_socket_service_local
                                             reader, writer = await asyncio.open_unix_connection(path)
                                           File "/usr/lib64/python3.8/asyncio/streams.py", line 111, in open_unix_connection
                                             transport, _ = await loop.create_unix_connection(
                                           File "/usr/lib64/python3.8/asyncio/unix_events.py", line 244, in create_unix_connection
                                             await self.sock_connect(sock, path)
                                           File "/usr/lib64/python3.8/asyncio/selector_events.py", line 496, in sock_connect
                                             return await fut
                                           File "/usr/lib64/python3.8/asyncio/selector_events.py", line 501, in _sock_connect
                                             sock.connect(address)
                                         ConnectionRefusedError: [Errno 111] Connection refused

dom0 qrexec[28371]: qubes.WindowIconUpdater: sys-net → dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes

dom0 lvm[4480]: No longer monitoring thin pool qubes_dom0-vm–pool-tpool.
dom0 lvm[4480]: Monitoring thin pool qubes_dom0-vm–pool-tpool.
dom0 systemd-coredump[15815]: Process 10276 (Xorg) of user 0 dumped core.

                                          Stack trace of thread 10276:

Just a few significant looking red parts of journactl.

My conclusion was, wait before updating the system till it goes into a more idle state and that either Heat spikes or some sort of bottleneck wanting data from my nvme causes to login crash.
Not sure how to test the effectiveness of my nvme + current state of nvme drives getting slower over time (there might or there might not be a fw patch that actually works).

Waiting before instant firing up updates seemed to do the trick.
Seemed because there were times where it did not, though rather seldomly.

Maybe i should post the journactl errors on github even though i fear my mobo might be to new, is there already a thread?

Upgraded a Lenovo X1 Carbon Gen8 from 4.0 to 4.1 via qubes-dist-upgrade. Like other people, rapidly encountered serious failures (hangs/reboots) under mild load (plus graphics problems reported elsewhere, apparently fixed with the intel driver and i915.force_probe=* as in https://github.com/Qubes-Community/Contents/blob/master/docs/troubleshooting/intel-igfx-troubleshooting.md). Tried a few random combinations of workarounds suggested in various threads, and so far haven’t been able to trigger a failure with the following combination: dom0 kernel 5.4 (which at the moment means 5.4.203) + turning off swap (swapoff -a) in dom0 as suggested in https://forum.qubes-os.org/t/experiencing-frequent-kernel-hangs-on-qube-4-1-with-5-4-80-1-qubes-kernel/2187.

2 Likes

I had a few days and overnights without freezes (which really impressed and surprised me).
However, today I had two freezes and both within some short amount of time after system boot (generally that would happen after a few hours of idling).

Now had a freeze, no obvious trigger, after a few days with 5.4 without swap. Will try 5.18 for a bit.

1 Like

No freezes for more than 14 days, since that update.