[qubes-users] Systemd terminating qubesd during backup?

slcoleman · October 11, 2021, 3:52pm

I seem to have an intermittent problem when my backup scripts are running late at night.

My qubesd is apparently being shutdown (sent a sigterm signal) by systemd during my long running backup sessions which then causes an eof pipe close exception and qvm-backup then receives a socket exception and immediately receives a second exception while still handling the first exception, thus the qvm-backup process gets forcibly terminated mid stream. Just prior to the qubesd shutdown I can clearly see that systemd had also shutdown/restarted the qubes memory manager (qubes-qmemman) too.

Q: What kind of background maintenance processing would actually require qubesd or the memory manager to be restarted?

Q: Could this processing be put on hold during backups?

Q: Or, how could I at least know when this maintenance is scheduled to happen so I could defer my own processing?

My scripts can certainly trap this error, erase the incomplete backup file, then loop and check for qubesd to complete its restart, and then finally restart my own backup processing, but why should this even be necessary?

When this happens its almost always during the backup of my largest VM which can take well over 90 minutes to complete. If I can somehow block/defer this kind of system maintenance until after my backups are complete that would be better than having to deal with trapping random restarts.

thanks,

Steve

(Attachment backup_error.txt is missing)

(Attachment journalctl.txt is missing)

marmarek · October 12, 2021, 7:31pm

I seem to have an intermittent problem when my backup scripts are running
late at night.

My qubesd is apparently being shutdown (sent a sigterm signal) by systemd
during my long running backup sessions which then causes an eof pipe close
exception and qvm-backup then receives a socket exception and immediately
receives a second exception while still handling the first exception, thus
the qvm-backup process gets forcibly terminated mid stream. Just prior to
the qubesd shutdown I can clearly see that systemd had also
shutdown/restarted the qubes memory manager (qubes-qmemman) too.

Q: What kind of background maintenance processing would actually require
qubesd or the memory manager to be restarted?

I guess that's logrorate (but it isn't clear to me why qubesd too, not
just qubes-qmemman service...).

Q: Could this processing be put on hold during backups?

Q: Or, how could I at least know when this maintenance is scheduled to
happen so I could defer my own processing?

If that's indeed logrotate, see `systemctl status logrotate.timer`

My scripts can certainly trap this error, erase the incomplete backup file,
then loop and check for qubesd to complete its restart, and then finally
restart my own backup processing, but why should this even be necessary?

When this happens its almost always during the backup of my largest VM which
can take well over 90 minutes to complete. If I can somehow block/defer
this kind of system maintenance until after my backups are complete that
would be better than having to deal with trapping random restarts.

Again, if that's logrotate, you can stop the timer before, and restart it
afterwards.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

adw · October 12, 2021, 8:00pm

This sounds like:

github.com/QubesOS/qubes-issues

qvm-backup error: Connection to qubesd terminated, followed by traceback

opened 08:28PM - 28 Apr 19 UTC

andrewdavidwong

T: bug C: core P: major r4.1-dom0-stable

**Qubes OS version** `R4.0` **Affected component(s) or functionality** …`qvm-backup` **Brief summary** While making a backup, `qvm-backup` fails with the message `Connection to qubesd terminated, reconnecting in 1.0 seconds`, followed by the usual VM list, followed by a traceback (see below). **To Reproduce** ``` $ qvm-backup [...] Making a backup... 0% [...] Making a backup... 63.58%app: Connection to qubesd terminated, reconnecting in 1.0 seconds Backup error: Got empty response from qubesd. See journalctl in dom0 for details. ------------------+--------------+--------------+ VM | type | size | ------------------+--------------+--------------+ [...] ------------------+--------------+--------------+ Total size: | 20.0 GiB | ------------------+--------------+--------------+ VMs not selected for backup: [...] Traceback (most recent call last): File "/usr/lib/python3.5/site-packages/qubesadmin/app.py", line 540, in qubesd_call client_socket.connect(qubesadmin.config.QUBESD_SOCKET) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/qvm-run", line 5, in <module> sys.exit(main()) File "/usr/lib/python3.5/site-packages/qubesadmin/tools/qvm_run.py", line 199, in main args = parser.parse_args(args, app=app) File "/usr/lib/python3.5/site-packages/qubesadmin/tools/__init__.py", line 396, in parse_args action.parse_qubes_app(self, namespace) File "/usr/lib/python3.5/site-packages/qubesadmin/tools/__init__.py", line 170, in parse_qubes_app namespace.domains += [app.domains[vm_name]] File "/usr/lib/python3.5/site-packages/qubesadmin/app.py", line 86, in __getitem__ if not self.app.blind_mode and item not in self: File "/usr/lib/python3.5/site-packages/qubesadmin/app.py", line 107, in __contains__ self.refresh_cache() File "/usr/lib/python3.5/site-packages/qubesadmin/app.py", line 60, in refresh_cache 'admin.vm.List' File "/usr/lib/python3.5/site-packages/qubesadmin/app.py", line 543, in qubesd_call 'Failed to connect to qubesd service: %s', str(e)) qubesadmin.exc.QubesDaemonCommunicationError: Failed to connect to qubesd service: [Errno 2] No such file or directory ``` **Additional context** This `qvm-backup` is run as part of a nightly script. It has worked fine with no changes almost all other nights for a long time (months to years). I made no configuration changes between this run and the previous successful run.

I agree that it's a serious bug. It makes no sense for logrotate to interrupt backups. Backups completing successfully and reliably is infinitely more important than rotating log files.

It looks like the issue has been fixed in 4.1, but I'm still experiencing on 4.0, as well. I've just gotten in the habit of trying not to let my backups run between ~1-6am. :\