Unable start any qube after forcefully canceling wyng backup

FudgingDonut · October 28, 2022, 11:06am

Hi guys

I recently tried to start using wyng to back up some qubes.
I try to send with sudo wyng --dedup --tag=init send.
I initally got a python error because of tags:

Traceback (most recent call last): File "/usr/local/bin/wyng", line 2977, in <module> monitor_send(datavols, selected_vols, monitor_only=False) File "/usr/local/bin/wyng", line 1786, in monitor_send tag = aset.Volume.Ses.tag_parse(tag_opt, delim=",") TypeError: tag_parse() missing 1 required positional argument: 'tag'

I try to send again without the tag flag.
I foolishly canceled the send because I wanted to do something else first.

when I start sudo wyng --dedup send again I get this error:
Preparing snapshots... Pairing snapshot for vm-and-dvm-private Traceback (most recent call last): File "/usr/local/bin/wyng", line 2977, in <module> monitor_send(datavols, selected_vols, monitor_only=False) File "/usr/local/bin/wyng", line 1761, in monitor_send = prepare_snapshots(selected if len(selected) >0 else datavols) File "/usr/local/bin/wyng", line 1283, in prepare_snapshots do_exec([[CP.lvm, "lvcreate", "-pr", "-kn", "-ay", "--addtag=wyng"] + tags File "/usr/local/bin/wyng", line 1154, in do_exec raise subprocess.CalledProcessError(err.returncode, err.args) subprocess.CalledProcessError: Command '['/sbin/lvm', 'lvcreate', '-pr', '-kn', '-ay', '--addtag=wyng', '--addtag=arch-71cb2590-3f10-40'-s', 'qubes_dom0/vm-and-dvm-private', '-n', 'vm-and-dvm-private.tock']' returned non-zero exit status 5.

I check /dev/qubes_dom0, and there seem to be some .tock volumes that i assume wyng created.
/tmp/wyng/err.log also says ‘vm-blablabla-private.tock’ already exists.

Since the .tock volume names seemed deterministic I just commented out this part of wyng:
do_exec([[CP.lvm, "lvcreate", "-pr", "-kn", "-ay", "--addtag=wyng"] + tags + ["--addtag=arch-"+aset.uuid, "-s", vgname+"/"+datavol, "-n",snap2vol]])
I clean up my wyng: sudo wyng --clean delete
Then I ran this command figuring that wyng would fix all the stuff:
ls /dev/qubes_dom0/ | grep -vPe '-(back|snap)$|^(root|swap)$|.tock$' | xargs -L1 sudo wyng add
I then uncommented the line in wyng again and added the rest of the volumes.
Adding went fine, but when I try to send with sudo wyng --dedup send, I get this error:
Pairing snapshot for vm-wws16-dl-dvm-private Pairing snapshot for vm-wws16-dl-private Pairing snapshot for vm-wws16-dl-root

Sending backup session 20221028-093515 to qubes://usb-guest-disp Send volume | vm-andoff-dvm-privateTraceback (most recent call last): File "/usr/local/bin/wyng", line 2977, in <module> monitor_send(datavols, selected_vols, monitor_only=False) File "/usr/local/bin/wyng", line 1822, in monitor_send dnew, dfreed = send_volume(datavol, localtime, ses_tags, send_all=datavol in send_alls) File "/usr/local/bin/wyng", line 1482, in send_volume with open(pjoin("/dev",aset.vgname,snap2vol),"rb", buffering=chunksize) as vf, \ FileNotFoundError: [Errno 2] No such file or directory: '/dev/qubes_dom0/vm-andoff-dvm-private.tock'

The .tock files are no longer in /dev/qubes_dom0/ and sudo lvscan shows a bunch of inactive .tock volumes, one for every regular -root/private volume.
I don’t know when this happened, but I can’t start any qube anymore, I always get this error:
Thin pool qubes_dom0-vm--pool-tpool (253:8) transaction_id is 9379, while expected 9380. Failed to suspend qubes_dom0/vm-pool with queued messages.

I haven’t tried any lv* commands yet out of fear. I also haven’t restarted or shut any qubes down.
Can I fix this? if not, will a regular qvm-backup backup created after the issue be fine to restore from?

enmus · October 28, 2022, 8:21pm

FudgingDonut · October 29, 2022, 11:02am

I ended up restoring a backup taken after the issue appeared, which worked.

unman · October 31, 2022, 1:29pm

If it needs to be said, a backup tool should never potentially compromise data.

I never presume to speak for the Qubes team.
When I comment in the Forum or in the mailing lists I speak for myself.

fujak · August 14, 2023, 11:31am

@tasket

I switched to Wyng a couple of days ago and ran into the same issue. This broke my system.

What happened: I was running wyng ... send. It was going for a full rescan even while I didn’t specify “–remap”, and had successfully backed up minutes before. I Ctrl+C killed it without thinking.

Following that, qubes can’t be started anymore:

  Thin pool qubes_dom0-vm--pool-tpool (253:8) transaction_id is 16242, while expected 16243.
  Failed to suspend qubes_dom0/vm-pool with queued messages.

The same error prevent further wyng send operations on vm-pool. Would be nice if you could help fixing that: that’s my daily machine & I need it for work.

fujak · August 14, 2023, 11:41am

Just do be sure: did you just wyng ... receive a previous snapshot? I assume you meant “restoring a backup taken before the issue appeared”.

tasket · August 14, 2023, 4:42pm

@fujack @FudgingDonut

This type of failure seems endemic on Qubes and stems from Thin LVM storage metadata becoming overtaxed and/or corrupt. It can be triggered by using Qubes or Wyng, since they both create/delete snapshots as part of their normal operation.

The severity of the problem can also vary, depending on whether any of the thin volumes become stuck in a read-only or ‘inactive’ state… if this happens to dom0 ‘root’ volume then the system may become un-bootable.

A couple months back I was helping a Qubes (but not Wyng) user I knew from social media with this issue, as I’ve experienced it a number of times myself. I will say that deleting snapshots sometimes helps and may even resolve the problem, but I’ve seen cases where running lvconvert --repair and lvextend --poolmetadatasize was necessary, as well as cases where the LVM pool was too far gone to recover. When you delete snapshots, keep in mind there are Wyng snapshots (which end in ‘.tick’ and ‘.tock’) in addition to the Qubes snapshots, so you’ll want to delete those as well. On Edit: For completeness, I used a different workaround back in 2019, but I’d recommend trying that only after trying snapshot deletion and other methods in the issue.

There is also the question of something called a “metadata snapshot” still being present, and you would want to delete that, too. Simply running Wyng again can remove it automatically, but the specific Linux command to remove the metadata snapshot for the Qubes default pool is: sudo dmsetup message qubes_dom0-vm--pool-tpool 0 release_metadata_snap. (If there was no meta snap to remove it will report “failed: Invalid argument”, which is fine.)

My own practice for avoiding these errors in the first place has been to make sure plenty of space is available in the Thin LVM pool, including space for metadata beyond what Qubes provides in its initial configuration. Another possible tactic would be to keep running wyng monitor on a regular basis, as doing so reduces both data and metadata use. Lately, I’ve also been recommending installing Qubes with Btrfs instead of Thin LVM to improve overall reliability; my daily driver system has been running on Btrfs since March without problems.

tasket · August 14, 2023, 5:37pm

off-topic, personal response to unman's post

I would like to know why someone would not mention resources for resolving an issue they themselves experienced just weeks earlier. Its clear that situation had nothing to do with Wyng. @marmarek

fujak2 · August 15, 2023, 9:02am

Ok, so that was a weird ride. First of all thank you for the fast & complete answer: immensely helpful.

I was not able to remove any snapshot or run lvconvert --repair because of that same error:

  Thin pool qubes_dom0-vm--pool-tpool (253:8) transaction_id is 16242, while expected 16243.
  Failed to suspend qubes_dom0/vm-pool with queued messages.

What I did is closing everything, reboot. Then I had *-snap and *-volatile as well as dispXXXX for everything that was open when the error first happened. Because I couldn’t do anything else, I tried the risky way:

    # vgcfgbackup -f /tmp/poolcfg qubes_dom0
    # cp /tmp/poolcfg /tmp/poolcfg.bak # backup the file before making changes
    # vim /tmp/poolcfg # remove the message1 section
    # vgcfgrestore -f /tmp/poolcfg qubes_dom0 --force

However removing the message1 section wasn’t enough: I add to tweak the transaction ID too. After that I could remove the snapshots, reboot, and everything worked fine, however…

When I tried to backup again, the wyng send -a command failed seemingly on the same partition I had Ctrl-Ced on. I tried to launch a VM to ensure the system was not broken again: all good.

Now for the weird part: I tried to wyng send -a again and then I was back to the wrong transaction ID error, cannot launch VMs and so on. So I fixed it once more, the same way. Also, this time I tried to remove the metadata snapshot, which I forgot the first time, but there was nothing to remove. (got the “failed: Invalid argument” error).

Now I have a functional system, all VMs seems to boot, but I didn’t try wyng again because I guess it’d break the system once more. Any clue what I should do from there ?

tasket · August 15, 2023, 5:57pm

Wyng is creating a series of volume snapshots along with a metadata snapshot, and it will tend to do this in the same pattern each time you run wyng send since it sorts the volume list by name. So there is likely a lingering inconsistency/corruption in your LVM thin pool that is triggered by this specific stress pattern.

What is happening under the hood is Wyng checking if certain snapshots exist and if they are fresh or stale. Then it deletes stale snapshots and creates fresh snapshots for any volumes that don’t already have one. It was during this procedure that the OP hit Ctrl-C, the effect of which is stopping it and then Wyng deleting all of its tmp files before exiting; the Ctrl-C sigint might also propagate to the subprocess for lvcreate or lvremove, in which case LVM should handle the event gracefully. Metadata snapshot doesn’t enter the picture until the next ‘acquiring deltas’ phase which is very brief, so it probably isn’t a factor.

The steps I would take next… Check the amount of free space in the pool metadata volume (possibly expand it so there 80% or more free), check again that the metadata snapshot is removed, and then delete all of wyng’s tick/tock snapshots (this doesn’t pose risk of data loss, it just makes the next backup session somewhat slow). Finally, I would take the LVM pool offline and use thin_check to see if it detects any problems.

In the long term, moving to Btrfs is recommended. A storage system should not rely on workarounds like “delete some volumes then reboot” to retrieve it from a quasi-state that it itself doesn’t recognize as a failure mode. There is no improvement over time; its always been like this AFAICT. There are no major OSes that install on Thin LVM by default in my experience, and Qubes seems to be odd in this respect. As much as I liked the idea of Thin LVM when Qubes switched to it, I couldn’t recommend it at this stage.

fujak2 · August 24, 2023, 10:30am

So, what I ended up doing was to delete my previous wyng backups and the /var/lib/wyng directory, and to start over. This went fine, so I suspect wyng was doing something specific that triggered the lvm failure.

However I can’t tell whether the issue was related to a lack of space. IIRC the highest usage ratio I’ve seen for the pool-meta was 79%. However, I’ve read somewhere that QubesOS keeps empty space on the disk to extend partitions on-the-fly whenever needed, and I still have plenty of that. Could be that this specific QubesOS mechanism didn’t work as intended and triggered lvm failure.

Anyway, I plan to switch to btrfs along with 4.2. I agree that something like lvm should not fail that way. If i get it correctly, I’ll be able to load my lvm wyng backup on the btrfs partition?

tasket · August 25, 2023, 3:44pm

Yes, I would say 79% usage for the metadata volume is very concerning/risky. The basic issue here is that metadata use can fluctuate much quicker than data use when complex operations are causing fragmentation, even if only momentarily. Therefore, I stopped following my data use % as an indicator and got better results by keeping metadata use well below 50%. You can do this with lvextend --poolmetadatasize <size> qubes_dom0/vm-pool. In your case I would at least double the current size. You might need to free up space in the volgroup before you can lvextend; in that case what I’ve done is to temporarily turn off dom0 swap with swapoff then use lvresize to shrink the swap lv a bit, and finally use mkswap to reformat the swap lv and then swapon -a.

I’m not aware of an auto-resize in Qubes. I do recall they were very concerned that some people were filling up their pool data, leaving dom0 root fs unable function and often making Qubes un-bootable. What they did in 4.1 was to move dom0 root to a separate dom0-only pool so that failure can’t happen again.

If i get it correctly, I’ll be able to load my lvm wyng backup on the btrfs partition?

Wyng will restore volumes that originated on LVM just fine on Btrfs or other fs, but it will be up to you to rename them appropriately under the new setup because you will have pathless LVM names like “vm-work-private”. There are different ways you can do this, such as renaming the archive vols to the right path before doing the receive. For example, wyng <dest> rename vm-work-private appvms/work/private.img – and then: wyng <dest> --local=/var/lib/qubes receive appvms/work/private.img. (The wyng-util-qubes wrapper currently in development will be able to handle this pathname difference automatically.)

Also note that in order to be able to wyng send from Btrfs the local path should be a subvolume. This is how I set it up (using the Qubes default pool as an example):

cd /var/lib
qvm-shutdown --all --force --wait
sudo mv qubes qubes-old
sudo btrfs subvolume create qubes
sudo mv qubes-old/* qubes

Of course, there are different ways to setup Qubes itself, so your main VM pool may have a different path than ‘/var/lib/qubes’.

fujak2 · January 3, 2024, 4:35am

Hello,
Thank you again fo this precise answer. I will do this once I upgrade to Qubes 4.2.
Here’s where I did read about the auto-resize feature: Why doesn't Qubes 4.1 use full disk space (LVM)? - #3 by marmarek