Dom0 Update Bricked All Qubes

concerned · September 19, 2022, 8:12pm

dom0 updates came through today, so I installed them and rebooted. The updates installed successfully (at least as reported by Qubes Update tool). I saw that there were Salt-related updates, but didn’t note anything else.

When I rebooted, I then used Qubes Update to force-check for updates for debian-11, whonix-gw-16, whonix-ws-16, which didn’t yield any updates. I rebooted again and ran into my first issue: When I tried to start my AppVMs, this would fail, throwing an error regarding the private storage pool already existing or something like that (my apologies for not capturing the exact error - I see so many errors in Qubes that redoing something or rebooting almost always works, so it’s usually not worth it to document these kinds of errors).

When I rebooted again, I was disturbed to see that all of my qubes have no private storage. They all report 0 “Disk Usage” in the Qube Manager and no qubes will boot. When I try to start any Qube (sys-net in this example), I get the error:
-Title: [Dom0] Error starting Qube!
-Message: volume qubes_dom0/vm-sys-net-private missing

I rebooted a few more times, hoping this would resolve itself (as many past errors have), but this is a persistent issue. I can’t start any qube, which means no networking and no updates to fix this. Any ideas?

Insurgo · September 20, 2022, 4:15pm

Your issue seems related to vm-pool exhaustion, meaning some update of templates root volumes might have consumed more space then really available in the pool?

When I came across this years ago (before dom0 pool and vm-pool got seperated in 4.1), even dom0 could not boot. Now, dom0 should boot, but app qubes might not. The free space widget should have warned you prior of exhaustion though, so your report here seems to miss important details leading to the cause of your issue.

Might be related to Logical volume “vm-sys-net-volatile” already exists in volume group”qubes_dom0” ?

On dom0 terminal, the output of the following should give highlights of what is happening with lvm pools, if you installed qubes with defaults, otherwise more details are missing about your use case:

sudo vgs
sudo lvs

concerned · September 20, 2022, 8:43pm

I haven’t received any warnings of any space issues. I have a 2 TB drive and wasn’t using anywhere near that much space leading up to this issue (maybe ~100-200 GB), so I’m not sure how this could happen. I don’t have any details as to what caused the issue other than what I’ve described. The only change I made was updating dom0.

I’ve seen several other threads with similar issues, but haven’t yet found a solution for my case. One interesting thing I have seen: When I type “lvscan -a” in the terminal, '/dev/qubes_dom0/vm-pool" and almost all others are marked “inactive”. The only “ACTIVE” ones are, '/dev/qubes_dom0/

root-pool
root
swap
root-pool_tmeta
root-pool_tdata

This doesn’t seem right - why would all of the others be inactive? I certainly didn’t do this. When I try to activate them with “lvchange -ay qubes_dom0”, I get the following responses:

Check of pool qubes_dom0/vm-pool failed (status:1). Manual repair required!
Child 4439 exited abnormally
Check of pool qubes_dom0/vm-pool failed (status:-1). Manual repair required!

When I type “vgs”, it lists the following for qubes_dom0:

#PV: 1
#LV: 81
#SN: 0
Attr: wz–n-
VSize: <1.82t
VFree: <98.26g

I’m assuming VFree is referring to there being less than 100 GB of free space in my 2 TB volume, but I don’t see how that could be possible. I don’t believe I’ve ever exceeded 20% of my disk space - at least not with my personal files. I do give some of my qubes a rather large “Private storage max size” (e.g. 200-250 GB) for temporary purposes (e.g. downloading large files, then moving them to NAS). Perhaps I’m not understanding what this does - does this actually consume this much space even if the qube isn’t using it? If so, perhaps this could be close to 2 TB.

concerned · September 20, 2022, 8:50pm

Interesting update:

A few weeks ago, I created a VM for torrenting and gave it a “Private storage max size” of 250 GB. Since I never implemented it, I went ahead and deleted it to see if this would alleviate a potential space issue. When I rebooted, everything seems to be back to normal. All LVs are now listed as “ACTIVE” via “lvscan -a”.

However, this doesn’t make any sense to me. First of all, that qube had no data in it. Even if it really was consuming 250 GB of space, this doesn’t explain why Qubes reports the following disk usage statistics:

Total disk usage: 9.7%
varlibqubes
2.1 data: 33.9% (6.6 GiB/19.5 GiB)
vm-pool
3.1 data: 9.4% (163.6 GiB/1734.6 GiB)
3.2 metadata: 18.5%

What am I missing here? How can the system be out of space until I delete a qube with a max size of 250 GB, then only be using 9.7% after deleting said qube in a 2 TB drive?

enmus · September 20, 2022, 11:07pm

Vfree is unallocated space on the disk, for over-provisioning purposes like. Actually you should set it to at least to 10% of your disk while installing Qubes, meaning to reclaim not more than 90% disk space when installing Qubes.

People are often confused with this, and I suggest them to temporarily install kde-partitionmanager in dom0. There, and only there, visually it’ll get more clear what is indeed going on with space for disks, dom0 and qubes

Insurgo · September 21, 2022, 3:51pm

Can you replicate? If yes, installation selected pool options, OS originally installed (4.1rcX, which might have impacted LVM creation which happens at install only) etc would be most probably welcome in an qubes issue for tracking, not the forum.

What I read here is a pool problem. Was the pool thin-lvm or lvm only? Was the torrent qube a standalone qube? All those details would matter, with the same vgscan and lvscan prior and after getting your weird behavior, as well as version of some tools deployed.

All those would have better traction under a qubes github issue, where people could state that they can replicate your issue as well and move this forward.

concerned · February 4, 2023, 5:25am

I was eventually able to do something (would have to dig through my notes) to get this working again. However, this issue should probably be disregarded because I’ve since discovered that my RAM is faulty, which may have been the source of this issue.

Thanks everyone for your help.