Multiple Qubes crashing at the same time in 4.1-rc2

ddevz · December 18, 2021, 8:37pm

I had several qubes lock up to a point that they wouldn’t shut down today. It generated kernel messages but i’ll start at the beginning:

I installed qubes 4.1-rc2 but when I tried to install qubes-url-redirector then use it by initiating a “open in disposable qube”, the disposable qube would launch but would never open a web browser. Also the qube would never close on its own, but thats not what i mean by “crash”. It just creates a running qube that does nothing but consume 4 gigs every time I try it. (Note that I didn’t manually close the dead disposables since i’d have to figure out which disposables were in use (from before i started with qubes-url-redirector.)). Locked the screen for the night.

Anyway, came back the next day, logged in, and like 80% of my windows from various qubes were gone, including all non-disposable webbrowsers.
I checked and the system uptime for dom0 was 11 days, so not a power failure thing.
I checked the list of running qubes. All 33 qubes were still running, but most had no output windows to the screen anymore. Trying to launch a webbrowser, or a terminal against a running qube did not seem to do anything.
I started sending the shutdown signal to all qubes of the 33 running (except sys-* qubes).

Most shutdown, but a few did not. It seemed to be just those I had tried to launch a new webbrowser/terminal on that were hung.

Now, those that were hung, i looked at the console log on dom0 and got messages like:

blk_update_request: I/O error, dev xvda, sector 415744 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
…
EXT4-fs (xvda3): I/O error while writing superblock
…
Failed to write entry (22 items, 759 bytes) despite vacuuming, ignoring: Input/output error
…

This implies a problem with the dom0 disk, but I checked dmesg on dom0 and it did not have disk related errors

As further evidence of the dom0 disk being ok, the output of running:
sudo smartctl -a /dev/nvme0
in dom0 includes:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
…

…
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Also note that it did not run out of system memory. It only consumed 110 Gigs (of 128 Gigs of ram).

Any thoughts?

If relevant, the “disk” is a M.2 module

ppc · December 19, 2021, 2:49am

assuming this is a workstation

seem like a disk error

definitely not a cable problem

ddevz · December 19, 2021, 10:59pm

Not sure. The current definition of what counts as a workstation vs what counts as a desktop escapes me. But if your just asking if its a non-laptop, then yes it’s a non-laptop.

I suspect you missed this line:

ppc · December 20, 2021, 12:36am

workstation is just a desktop that have much more process power that not possible in normal desktop (like 128gb of ram is not possible in normal desktop since most desktop cpu is limited to 64gb of ram)

really, dmesg log?

ddevz · December 20, 2021, 4:07pm

I don’t understand the question.

If it helps, the output of:
[user@dom0 ~]$ sudo smartctl -a /dev/nvme0
performed in dom0 includes:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
…

…
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged