[qubes-users] R4.0.4 RC1 Unable to delete or backup certain qubes

I installed R4.0.4 RC1 and have been having some very odd issues with just a few of the VM’s I restored from backups, thus I have been restoring, cloning, testing, and deleting clones while trying to figure out a few things.

The first problem I was originally chasing was why one VM in particular never completes a backup and just hangs at 0% while the file grows only to about 40kb (the header info?). That specific VM starts and runs just fine but just won’t complete a backup. A Clone of it runs fine as well.

At present, I now have a few VM’s that I am unable to delete for some reason. Both the command line tool and Qube manager fail to remove these VM’s. From the command line tool I get a “qvm-remove: error: Got empty response from qubesd.” message. In the journalctl I see a “domain not found” error, when it gets a second exception and then terminates.

I’m guessing that these two problems are likely related, but I’m not sure how. I’m guessing there is something wrong with qubesd. I have attached the relevant logs for the delete problem.

Any ideas?

thanks,

Steve Coleman

(Attachment journalctl.txt is missing)

The backup not completing can occur if the VM is online, and you are not
using LVM.

Mike.

Well, the thin pool is LVM, but if the VM is offline, there should not be a problem. Guess you'll have to investigate all the logs you can
find.

Mike.

| Mike_Keehan
November 14 |

  • | - |

Well, the thin pool is LVM, but if the VM is offline, there should not be a problem. Guess you’ll have to investigate all the logs you can
find.

I finally have the answer!

Thankfully this problem has nothing to do with R4.0.4 but rather a brand new disk drive failing (MTBF<=5.2 days, likely earlier) in a rather odd way. What had me stumped is why the VMs would would seem to run fine but completely hung the backup process while reading the exact same volumes. It appears that all the VMs that were acting odd were all allocated on the same physical drive, but nothing ever gave any kind of an error when they were reading the drive. It was likely the per-VM metadata needed for the backup system that failed first.

Fortunatly the drives built in “smart” log holds the records for the last 4 errors, which can be easilly checked, and this allowed me to identify which physical drive needed to be yanked and replaced. Being a brand new system I did not yet know which logical drive mapped to which physical drive. To analyse the problem I used a “smartctl” tool variant on another system to check the logs that are stored physically within the drive.

Since checking each drive in this way is relatively efficient and easy it seems to me that there must be an automated way to check these error logs and notify the user when a drive is starting to fail. My Qubes system was completely silent and it was only because of the odd behaviour of the backup system that I was forced to investigate. If the backup process didn’t just hang then all my future backups could have been trash, and I would have not even noticed the issue until it was too late. Why wait until the system is completely unusable?

So, my question to the Qubes community is, has anyone out there set up this kind of “smart” disk check up on Qubes? What are the best tools for a quick check, say upon each boot, or one that could easilly be put in cron for a periodic/daily go-no-go health check?

Thanks,

Steve

Hi

Since checking each drive in this way is relatively efficient and easy
it seems to me that there must be an automated way to check these error
logs and notify the user when a drive is starting to fail. My Qubes
system was completely silent and it was only because of the odd
behaviour of the backup system that I was forced to investigate. If the
backup process didn't just hang then all my future backups could have
been trash, and I would have not even noticed the issue until it was too
late. Why wait until the system is completely unusable?

So, my question to the Qubes community is, has anyone out there set up
this kind of "smart" disk check up on Qubes? What are the best tools for
a quick check, say upon each boot, or one that could easilly be put in
cron for a periodic/daily go-no-go health check?

I personally would recommend btrfs specially if you have a ssd hard
disk. Although it supposes some performance lost you will get a more
reliable data consistency and you can check all your data just doing
"btrfs scrub start /" (or "btrfs scrub start / -c Idle" if you don't
want it lags too much your system while working).

I was using also btrfs-send/receieve but ultimately it seems that there
is some problem that causes a CPU bottleneck with my big non-ssd hard
disk. The main ssd disk stills working fine but I am thinking on another
way for incremental backups.

I would like to experiment with borg so I could do backups at file level
adding some cool features like ignore some paths (e.g. '~/.cache') or
restore a single file without uncompressing/unencrypting the whole image
and also have deduplication.

Regards.