QubesOS freeze, crash and reboots

The 5.19 line is known for “numerous issues” according to qubes-devel.

Anyway the 5.10 LTS kernel line (qubes-dom0-update kernel-510) runs fairly stable for me. The newer ones just crash whenever sys-usb is started.

I’m not aware of it, and what you say actually might confirm that issues aren’t related to kernels, as I suspected, but probably with Qubes and Xen?
It’s the simple fact that issues began this summer, and all kernels >= 5.10 until that point (I can’t remember which was the last, let’s say some 5.15.64) never produced issues of this topic, but now even <=5.15 kernel aren’t stable?

Well, I guess the issues are related to a combination of Xen version & Linux kernel version. I’m not sure whether it’s any particular fault with the Qubes OS devs apart from them adopting new dom0 kernel & Xen versions pretty quickly. This helps with newer hardware, but may well break older.

On my T530 Xen 4.14.5 and dom0 Linux kernel 5.10.136.1 work relatively stable, even if performance is still way worse than it was with 4.0 (but that’s another topic). Newer kernel versions don’t work for me, but I’m experiencing a different issue than you (dom0 crash on sys-usb start apparently due to a very specific piece of hardware attached to sys-usb).

From all the topics on the forum and qubes-issues I currently feel that with 4.1 every user may have to experiment to find a working dom0 kernel for his/her hardware… it certainly shouldn’t be that way, but I fear these are upstream Xen or Linux regressions.
I’m also under the impression that the Qubes devs have little time for these very hardware specific topics as they are very time intensive to debug (+ hardware costs). Both time and money are apparently scarce on the Qubes OS project unfortunately. E.g. I’m not sure who of the devs still has funding to work full time on the project… Considering the relatively small dev base of the project, it’s actually amazing that it’s still alive and I’m very grateful for that.

1 Like

Beside everything else you wrote, I particularly agree with this perspective, actually.

(posting here after seeing your recent post on qubes-devel)

I have a T450s that I bought brand new in 2016 with the best model specs available, exclusively for use with Qubes OS - I haven’t used anything other than Qubes OS since then. With 3.x and 4.0 I seldom had issues (maybe the occasional hard freeze) but since a few months (unfortunately can’t really tell when - probably since Feb/March) I get display corruption/glitches when the laptop’s undocked, so bad that it’s impractical to write in a vm’s terminal (no corruption in dom0 though) - reminds me of writing blindly trying to anticipate return packets when using a 1200 baud modem back in the day. I tried switching to the intel driver which fixed the corruption but I’d then get hard freezes so often that I reverted back to fbdev. The level of screen corruption varies and I couldn’t find any pattern but it seems that it’s a bit better after a reboot. Those days my laptop is docked 99% of the time so I didn’t try to really investigate this - but it’s clearly an issue.
Also - since a bit of time I can’t work for more than 5-15 minutes with libreoffice writer: all the vm’s windows disappear (yet the vm is functional with qvm-* commands in dom0). I’ve lost hours of work because of that and as as stupid workaround I’ve set libreoffice’s autosave to 1 minute which kills my laptop when working with large docs.
I haven’t found a pattern but it seems that anything graphically intensive crashes the vm’s gui - or worse, triggers a hard freeze. When a vm’s gui crashes guid.*.log usually shows “XshmAttach failed for window […]”.
I’m also getting more random hard freezes than before - not only with graphic intensive app - but nothing reproducible. There’s indeed a feeling as a whole that the current qubes os version isn’t as stable as before. I took the “lazy” approach of waiting for someone to fix/report this - given that T450s’ are common (I think @adw has/had one) - but it didn’t happen (+ I don’t have much time those days to spend on debugging stuff). I’d be happy to help if needed.

1 Like

Just had dom0 updates now:

Summary
Updating dom0

local:
    ----------
    kernel:
        ----------
        new:
            1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes,1000:5.15.74-1.fc32.qubes
        old:
            1000:5.15.63-1.fc32.qubes,1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes
    kernel-qubes-vm:
        ----------
        new:
            1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes,1000:5.15.74-1.fc32.qubes
        old:
            1000:5.15.63-1.fc32.qubes,1000:5.15.64-1.fc32.qubes,1000:5.15.68-1.fc32.qubes
    python3-xen:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-hypervisor:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-libs:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-licenses:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32
    xen-runtime:
        ----------
        new:
            2001:4.14.5-9.fc32
        old:
            2001:4.14.5-8.fc32

Worthy to follow if something will change with crashes now…

1 Like

Wierd, I am not experiencing any of the problems listed above on my Librem 15 with R4.1 and 32 GB of RAM. No crashes, freezes, reboots. Only slowness like this.

Sorry, I haven’t experienced any freezing or crashes on my T450s with Qubes 4.1. Sometimes it feels a bit slower than 4.0. (But that’s just a feeling. I haven’t done any scientific testing.) My usage seems somewhat light compared to a lot of power users and developers around here, though.

Hello everyone,
I started to have very similar issues/crashes as you have on my Thinkpad P1 gen3 laptop:

$ lspci | grep VGA
- 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05)
$ uname -r
5.18.16.1.fc32.qubes.x86_64

The trace I am getting is following:

Oct 17 15:47:53 dom0 kernel: BUG: Bad page map: 992 messages suppressed
Oct 17 15:47:53 dom0 kernel: BUG: Bad page map in process Xorg  pte:8000000adaf14365 pmd:135f77067
Oct 17 15:47:53 dom0 kernel: page:0000000074dfd1dd refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x195d94
Oct 17 15:47:53 dom0 kernel: flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
Oct 17 15:47:53 dom0 kernel: raw: 0027ffffc0003408 ffff88810376d300 ffffea0006576540 0000000000000000
Oct 17 15:47:53 dom0 kernel: raw: 0000000000000000 0000134500000007 00000401fffffffe 0000000000000000
Oct 17 15:47:53 dom0 kernel: page dumped because: bad pte
Oct 17 15:47:53 dom0 kernel: addr:00007ed79a637000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888100290508 index:7af
Oct 17 15:47:53 dom0 kernel: file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
Oct 17 15:47:53 dom0 kernel: CPU: 2 PID: 6715 Comm: Xorg Tainted: G    B   W         5.18.16-1.fc32.qubes.x86_64 #1
Oct 17 15:47:53 dom0 kernel: Hardware name: LENOVO 20TJS2F44A/20TJS2F44A, BIOS N2VET37W (1.22 ) 01/18/2022
Oct 17 15:47:53 dom0 kernel: Call Trace:
Oct 17 15:47:53 dom0 kernel:  <TASK>
Oct 17 15:47:53 dom0 kernel:  dump_stack_lvl+0x45/0x5e
Oct 17 15:47:53 dom0 kernel:  print_bad_pte.cold+0x6a/0xc5
Oct 17 15:47:53 dom0 kernel:  zap_pte_range+0x430/0x8b0
Oct 17 15:47:53 dom0 kernel:  ? __raw_callee_save_xen_pmd_val+0x11/0x22
Oct 17 15:47:53 dom0 kernel:  zap_pmd_range.isra.0+0x1b8/0x2f0
Oct 17 15:47:53 dom0 kernel:  zap_pud_range.isra.0+0xa9/0x1e0
Oct 17 15:47:53 dom0 kernel:  unmap_page_range+0x16c/0x200
Oct 17 15:47:53 dom0 kernel:  unmap_vmas+0x83/0x100
Oct 17 15:47:53 dom0 kernel:  unmap_region+0xbd/0x120
Oct 17 15:47:53 dom0 kernel:  __do_munmap+0x177/0x350
Oct 17 15:47:53 dom0 kernel:  __vm_munmap+0x75/0x120
Oct 17 15:47:53 dom0 kernel:  __x64_sys_munmap+0x17/0x20
Oct 17 15:47:53 dom0 kernel:  do_syscall_64+0x59/0x90
Oct 17 15:47:53 dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Oct 17 15:47:53 dom0 kernel: RIP: 0033:0x7ed7a34e237b
Oct 17 15:47:53 dom0 kernel: Code: 8b 15 21 6b 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed 6a 0c 00 f7 d8 64 89 01 48
Oct 17 15:47:53 dom0 kernel: RSP: 002b:00007fff098b4488 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Oct 17 15:47:53 dom0 kernel: RAX: ffffffffffffffda RBX: 0000000000000055 RCX: 00007ed7a34e237b
Oct 17 15:47:53 dom0 kernel: RDX: 00007fff098b44a0 RSI: 0000000000055000 RDI: 00007ed79a637000
Oct 17 15:47:53 dom0 kernel: RBP: 00007ed79a637000 R08: 0000000000000008 R09: 0000000000000000
Oct 17 15:47:53 dom0 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000009
Oct 17 15:47:53 dom0 kernel: R13: 00006080c7ff49d8 R14: 000000000000005f R15: 00006080c6bc9e00
Oct 17 15:47:53 dom0 kernel:  </TASK>

These problems started to pop up when:

  1. I switched my window manager from xfce4 to i3
  2. Then I had started to see small random graphical artifacts, so I switched the Xorg driver from modesetting to intel according to the following issue: Use generic modesetting driver instead i915/i965 as default · Issue #4782 · QubesOS/qubes-issues · GitHub
  3. Artifacts were gone, but my system have started crashing.

Right now I have switched back to xfce4 with modesetting driver and everything seems to be back in normal.

1 Like

I think that is not the same issue that affects other users, although it
may reflect some common underlying problem.

1 Like

So where exactly are we at with this? Has anyone got official word from any of the developers? To me the only thing more unacceptable than foolishly releasing an operating system that has blatant stability issues is to leave users completely in the dark while doing so. Perhaps I am just not active in the right channels, in which case can somebody please point me to where I can get updates on actual work that is being done to solve this problem?

The frustrating thing is that I’ve spent so much time adjusting my workflow to fit in with Qubes that I now struggle very much to use a “traditional” operating system. I can’t just accept that crashing every few hours is acceptable though. Stuck between a rock and a hard place.

You have to wonder how it would be possible for Qubes to release 4.1.1 without testing it? It’s not like this is the usual hardware issues that people face with Qubes, this is certified hardware regressing to the point of being unusable.

Who provides funding to Qubes? Who are the investors? Surely anyone who has a interest in the success of this operating system would be appalled at what is going on.

Can anyone who has previously experienced crashes and are now no longer experiencing crashes post here? I think I will need to purchase a new laptop and would like to get some hardware recommendations.

1 Like

Great to see that work is being done, but that doesn’t seem like the issue most people are experiencing here? I don’t crash to the login screen, I either get power off or reboot.

I too are affected by this issue @howfuniscrashing and I am seeing community members reporting, interacting and developers being involved in the appropriate places (developer mailing list and qubes-issues).

I also agree with your point about this being a big deal and certified hardware being affected. But that’s where our agreement ends.

Everything else about your post makes me think you are entirely mistaken about your place in the community. Have you contributed anything? Helped anyone? Is this your very first post? Does anyone here owe you anything?

Or are you in fact benefiting from a breathtakingly awesome project that is provided to you without any strings attached, open source and without any charge?

Normally I like to say “welcome to the community” … I’ll say it to you too. But please reconsider your manners if in fact you are interested in getting a response.

You think this is my first post? Or perhaps I made an alias to express my frustration.

Perhaps I’ve made many contributions to this project. Perhaps I was involved in a PR that was very recently merged that has the potential to save people in a very big way.

I give as well as take. I’ve analaysed significant portions of the qrexec codebase and provided help to many people.

I wont be reconsidering anything.

Perhaps you don’t understand that many people use Qubes because of its Whonix integration. Perhaps you don’t understand that many people rely on Whonix to stay alive. Perhaps you don’t understand that issuing updates that cause frequent crashing could potentially endanger those that use Qubes solely for Whonix. This is why my tone is the way it is.

1 Like

There are a number of issues with 4.1.1, some of which have developed over
the past few months. Some of these issues did not become apparent during
the rc phase, or have developed after the release with updated packages.
I raised exactly this issue (certified hardware becoming unusable) in the
testing forum and on the dev mailing list. We are making efforts to
identify the root cause and resolve the problems.
Qubes is funded by its members , and substantially underwritten by ITL.
There are no “investors” as such, but we all have an interest in resolving
these issues and making Qubes a success.

I never presume to speak for the Qubes team.
When I comment in the Forum or in the mailing lists I speak for myself.
2 Likes

That’s all understandable, but after many months of crashing one starts to wonder why you wouldn’t revert back to the latest known good state and only apply the absolutely critical security patches? Or at least make some sort of PSA about what is going on.

As it stands, anyone who downloads Qubes from the website and performs an update is likely to run into crashing issues. They will then either give up on the OS or spend hours and hours and hours trying to find a solution, just to eventually figure out that Qubes are shipping them a known-buggy OS. Is either of these outcomes acceptable?

undone

Still no crashes, and now kernel is 5.19.14-1 with Xen 4.14.5

1 Like

I make burners for most of my posts. I browse exclusively using disposables and only save passwords when necessary. Why does my identity matter? I think what’s more important is fixing persistent crashing that’s been occurring for 3 months.

2 Likes