Qubes lack connectivity based on how they are started

Sven · January 9, 2023, 8:27pm

Some of my qubes suddenly have no connectivity. DNS works, accessing local admin panel via HTTP works but no external web / signal.

Previous version of this post

This is a mystery to me. I have several qubes all based on the same ‘web’ template:

work-web (HVM)
private-web (PVH)
banking (PVH)
medial (PVH)
online-dvm (PVH)
online (named dispvm based on online-dvm)

Yesterday they all worked. I then changed my dom0 memory from 2GB back to 4GB (the default). First none of them EXCEPT the DISPVM had (external) network. DNS lookup and accessing the internal web admin panel of my sys-pi-hole worked, but no external routing.

Then I rebooted once more and work-web works again as before. So does ‘online’ but all the others don’t.

There have been no changes in settings. Other network connected qubes (e.g. mail template based) also work.

What’s going on?

Then I rebooted one more time.

Sven · January 9, 2023, 8:49pm

More of the same and ultimately irrelevant

This is wild. I rebooted once more and ‘work-web’ stopped working again. Then I changed dom0 memory back to 2GB. No change.

Throughout the dispVM (same template, netvm) keeps working

Sven · January 9, 2023, 9:19pm

Ok, it’s a timing thing and has nothing to do with my dom0 memory allocation.

If I start the respective qubes rapidly after each other using Whiskermenu, they end up having no connectivity beyond DNS.

If I start them as usual with a bash script linked to a hotkey, they all work. The bash script looks like this:

qvm-run qube1 app1 &
qvm-run qube2 app2 &
[etc]

sleep 60

I added the sleep a long time ago, as otherwise not all qvm-run’s seemed to execute.

I clearly don’t know what I am doing and in my subjective user perspective Qubes OS R4.0.2 was a stable rock and I’ve never reached that level of confidence with R4.1.

If anyone has an idea what I am doing wrong, I’d love to be educated.

Note: edited the title of this thread to “Qubes lack connectivity based on how they are started” and edited previous posts for clarify

enmus · January 10, 2023, 12:59am

Probably irrelevant: people here mostly have issues when using non-fedoras.
I never succeeded to make my system smooth with Debian, I even gave up on cacher - just too many issues, never could making Qubes getting along with Kali, Gentoo, CentOS, so I completely transitioned back to fedora. No freezes, crashes, and to me it looks like it has something to do with Debian when recalling notorious topic.
Beside Fedora, my second most stable is Windows (standalone and templates+dispvms)
Not only that, but I’m almost all on fedora-37-minimal. More than satisfied.

balko · January 10, 2023, 6:09pm

Maybe you should also create an issue on github to increase the changes of this issue being diagnosed and fixed.

renehoj · January 10, 2023, 8:38pm

Does the vif get created in the netvm?

cayce · January 10, 2023, 10:50pm

This behavior seems to be a running joke (punny right?) for many users. I can’t imagine what’s going wrong with so many user’s systems but, I can not recommend strongly enough to rid yourself of all the bloated sys-<xyz> qubes and rebuild them using minimal templates. You don’t have to thank me later but, you will thank yourself.

FWIW, 4.1 has been an even better ride here than 4.0, using debian templates.

Sven · January 10, 2023, 11:52pm

@enmus I don’t believe this has anything to do with fedora/debian. I’m using debian-minimal based templates/qubes for years now and never seen anything like this. Also as I pointed out this issue manifested over night. No updates installed no software installed and the only config change I have done within weeks of this happening was changing the memory assigned to dom0 the evening before this started happening.

@balko if have a better understanding of what’s happening I certainly will. I doubt the devs need another issue of a user saying “it stopped working and I didn’t do anything”.

@renehoj yes vif gets created and the issue does not affect all the qubes connected to the netvm, only the ones based on my debian-11-web template (minimal based). I am right now rebuilding my templates via the bash scripts I talked about in other threads and once that’s done I will create new qubes based on them and manually move app data over. If that fixes it, maybe I create a github issue … but then still I don’t know what steps I would list to reproduce the issue.

@cayce are you saying you have seen other reports of this issue happening? Can you point me to some please? All my qubes/templates are debian-minimal based.

My statement about the R4.1/R4.0 needs context. I am running a T430 that is identical to what is listed as ‘certified’ by the project. It has been a dream of performance, stability and functionality under R4.0. With R4.1 it appears more and more issues start popping up that haven’t existed for me before. Maybe the XEN project doesn’t pay much attention to Ivy Bridge based systems anymore? X230’s and T430’s have been the goto machines for a long time but maybe that’s not the case anymore.

enmus · January 11, 2023, 12:02am

There’s first time for everything, well yes. I’m on Ivy Bridge too, but on fedora-37-minimal with all hardware (2 to go to full: sys-audio and split-browser). No crashes, frezees, weird things… As stable as dreamed.

So, try your setup on fedora minimal and check if reproducible…

cayce · January 11, 2023, 1:45am

All as in your AppVM guest qubes? Or, all as in ALL (sys-net, sys-usb, sys-firewall, sys-vpn, sys-dns, sys-ips, sys-audio) your qubes?

To be clear, I’m suggesting that it would be a good idea for anyone with annoying timing issues to rebuild service qubes based on minimal templates.

renehoj · January 11, 2023, 7:23am

I can’t reproduce it on my system

Installed a fresh minimal template and added the networking-agent and firefox to the template, and then I made 5 test qubes with the template.

I tried using the whiskermenu and starting firefox as fast as possible, all 5 qubes worked. I tried doing the same from the terminal with 5 terminals open and the command ready to execute, same result all 5 qubes worked. Tested both methods a second time with the same results.

I am using a reasonable fast desktop CPU, if this is a race condition it might not trigger on my system.

Sven · January 11, 2023, 4:48pm

@cayce all as in all … there are only debian minimal based templates/qubes. I also documented how I create them in this forum and on my website (unfinished draft).

@enmus very interesting. I’ll give this a try soon.

BEBF738VD · January 11, 2023, 5:08pm

What kernel are you running on dom0 and vms?

I’ve also been running an all-debian minimal setup (kernel-latest for dom0 and 5.10 for vms) for months now without a single issue, so the situation you’ve described is indeed weird.

Sven · January 11, 2023, 5:23pm

@bebf738vd dom0 and all qubes run 5.15.81-1.fc32.qubes.x86_64

New templates did not solve the issue, now recreating qubes and moving app data manually. If that doesn’t fix it I will take @enums advice and (temporarily?) switch my stuff over to fedora-minimal based.

cayce · January 11, 2023, 5:38pm

I’m in the same boat as @BEBF738VD … Indeed sounds very strange; especially with the t430.

Is it possible some update/policy was pushed to your upstream router? Have you the same experience when leveraging an alternate uplink? Might be worth heading to a local cafe to see if the problem persists.

Outside of this, have you grepped your logs for any sign of the hardware issues/failure?

Reaching, I know ..

Have any animals in the household? Maybe upon the case up and hit it with some compressed air?

enmus · January 11, 2023, 9:00pm

When you decide this, you can PM me and I can send you my notes on creating different templates and use cases. They look for example like:

fedora-37-min-sys-usb-xHCI-template
--------------------------------
mlocate qubes-input-proxy-sender qubes-usb-proxy usbutils 
Maybe it is needed
[user@dom0 ~]$ qvm-pci attach --persistent --option permissive=true sys-usb dom0:00_14.0

fedora-37-min-sys-firewall-template
-------------------------------
iproute iptables-legacy iptables-legacy-libs iptables-libs nftables qubes-core-agent-dom0-updates qubes-core-agent-networking tinyproxy

etc…

Sven · January 11, 2023, 11:29pm

Ok, here are all the things I did:

recreated all templates based on debian-11-minimal
recreated all qubes from scratch based on the new templates and then manually imported the respective settings and data
recreated system and web template based on fedora-36-minimal

In all cases I get the same behavior:

the dispvm always has connectivity
the mail qubes always have connectivity
the other web qubes sometimes have, sometimes don’t … in two cases they lost connectivity while running (other qubes remained online)

I see no hints in dom0 logs.

I cannot overstate how stressful this is. This machine has been my daily driver for a long time. My setup is stable-stable. No tweaking not even installing new apps. I’ve been using it the way it was for months.

Karma?

kysstfafm · January 11, 2023, 11:52pm

I’m likely off-base but:

Does it make sense to investigate any iptables-related logging (I have not looked up yet to see what all keeps logs - iptables not among them?) for the purpose of checking on the web VMs that have then lose then regain connectivity? You say that you looked at dom0 logs but maybe selectively with key VMs (sys-net or equivalent et al) some highly specific networking-related software for any that keep logs? If you modded enough to be able to run wireshark on a few VMs for internal data collection points (within a web VM or outside of it in a VM further along the path), then perhaps you could collect usable info about what actually happens when the loss happens.

cayce · January 12, 2023, 1:48am

Maybe Layer 7?

CVE-2022-23529

Sven · January 13, 2023, 8:11pm

Thanks for all the input. I will do a complete reinstall and start off with standard Fedora templates and create my qubes from scratch. I won’t make any changes in dom0 at all. Not even whisker menu or redshift.

Then I will test this on both of my identical T430.

If I still see issues then, I guess I’ll file a bug report.