[qubes-users] sys-firewall freezing on resume from suspend

Hi,

I have a really annoying issue with resume from suspend. On resume, sys-firewall is crashed/freezed/unresponsive. So on every resume from suspend, I need to kill and restart this VM if I want to use networking. Other qubes are fine, except that sys-whonix also freezes, but this is because it can't get a network connection to sys-firewall.

The VM is based on the default debian 11 template without any special modifications. It has worked fine this way for years. Qubes is the latest version. Kernel used 4.10.112.

Symptoms:
- High reported ram/cpu use, cpu hovering around 10-20%
- vm terminal: shows a blank window, no input/output shown
- xen console in dom0: no output
- does not pass networktraffic from connected VM's
- stopped connected VM's can't start because of failed vif (network connection) creation.
- sometimes, after a shorter suspend, the VM still works, or it does pass networktraffic while the vm still can't open a terminal window.

I've tried:
- checking both before and after suspend the VM console and syslog, dom0 journal, dmesg, xen logs. It doesn't show any relevant error as far as I can tell.
- creating a fresh sys-firewall VM. No change.
- switching the VM to a fedora 35 template, fully upgraded. No change
- checking possibly related issues on qubes github. But those are all either fixed with updates, or about VM's with PCI devices connected, which this VM doesn't.

What is this problem? Why does it only occur with sys-firewall VM? Which logs to doublecheck? Any suggestions welcome.

So, apparently, this is not a sys-firewall, but a clocksync issue. To root out any causes, I moved the clocksync service to a separate, brand new qube (named sys-clock). And voila: sys-firewall no longer 'crashes' on resume from suspend, now it's sys-clock.

The cause is probably somewhere in some logfile, but with the many moving parts, Qubes really needs a better bugfixing howto. With relatively many minor bugs like this, bugfixing takes too much time. I don't mind spending some time fixing bugs, but lately it is really becoming too much, to the extend that I am considering switching back to an easier regular Linux distro. I have been a paid Linux sysadmin, no total expert, but that is also not a requirement to use Qubes. I should be able to diagnose bugs on my own laptop (and contribute to the project by properly reporting them).

This should probably be filed as an issue:

Someone else filed an issue where this was solved for me: Error when shutting down whonix templates and Qube Manager becomes unresponsive (Failed to shutdown domain '18' with libxenlight) · Issue #7510 · QubesOS/qubes-issues · GitHub. Briefly put:

Manually applying the patch from Properly suspend all VMs, not only those with PCI devices by marmarek · Pull Request #473 · QubesOS/qubes-core-admin · GitHub to dom0:/usr/lib/python3.8/site-packages/qubes/vm/qubesvm.py and then restarting seems to have solved the issue. Also clock syncing trouble after suspend seem to have improved. So this was a suspend and not a clock or firewall issue.

This should come soon to dom0 as an update I guess

Indeed, you should be able to. The fact that you cannot is itself a
bug. Please report it.

- --
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Properly suspend all VMs, not only those with PCI devices by marmarek · Pull Request #473 · QubesOS/qubes-core-admin · GitHub will (hopefully)
fix this.

- --
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Yes, it! Error when shutting down whonix templates and Qube Manager becomes unresponsive (Failed to shutdown domain '18' with libxenlight) · Issue #7510 · QubesOS/qubes-issues · GitHub

To prevent soiling the issues list, and make it a little more actionable, let's first discuss this here.

What I need is a little more help with fixing or adequately diagnosing bugs, as a sysadmin level person, no programmer or Xen or Qubes expert. As said, to be able to fix or report & diagnose bugs and other issues better. For instance, a list of logfiles added to standard fedora by qubes/zen would be helpfull. So just a list, no further explanation of how to use logfiles. I don't have more ideas currently, but there probably are.

What worries me a little bit is that documentation like this might encourage less skilled people to start doing things above their level of ability (although this is also a good start to become more skilled). Like, in the case of logfiles, soiling communication channels with non-relevant information. So it should come with a clear warning.

Suggestions (or critique) welcome.