I would normally not post at all, but it’s likely this issue affects others, so I will make an exception.
Immediately after I upgraded to Qubes 4.3, I noticed that my NetVMs drop traffic severely; I can’t resolve DNS, TCP connections never progress past half-open state, the works. However, if I run tcpdump in the NetVM, then suddenly networking works perfectly and no drops take place.
Now, I have a fairly complex setup using Qubes as a better Proxmox, with a star topology where a bunch of NetVMs talk to each other through a “nexus” VM which has Xen network links to each NetVM (the nexus has the xen-backend VIF while the individual NetVMs have the xen frontend VIF), so NetVMs can ultimately route traffic between VLANs assisted by the nexus in the middle.
TLDR: The fix was to forcibly set the kernel preference on each NetVM to use an old Qubes 4.2 6.12.37-1.fc37 kernel (I was lucky enough that the upgrade had left old VM kernels around in dom0). With this kernel, NetVMs run flawlessly and traffic stops being dropped.
In further investigating the issue, I tried to tweak networking settings of one NetVM once again. This NetVM routes traffic to/from the Home Assistant VM thru the nexus, which eventually hits the DNS resolver and the voice assistants.
I tried the latest shipping kernel for Qubes 4.3 (kernel-qubes-vm-6.18.19-1.qubes.fc41.x86_64). As feared, immediately after booting the NetVM, the voice assistant couldn’t communicate with Home Assistant, and DNS was broken too.
Then I disabled scatter-gather with ethtool on the Xen VIF facing the nexus VM, and voila, DNS started working again. However, even with scatter-gather off, the voice assistant was taking about 20 seconds to go from question to answer (normally, a question like “what time is it?” would ellicit a response in less than a quarter of a second). In other words: now it works most of the time, but very slowly. My Grafana graphs of packets dropped see a very steady amount of packets dropped per second too (but with scatter/gather off at least I can see the graphs!).
So disabling scatter-gather or other offload options on the network VIF did not solve the problem — it merely mitigated it somewhat. The only mitigation that works 100% is to keep all NetVMs stuck to kernel 6.12.37 (interestingly, the nexus, which uses the latest VM kernel, has no issue).
There’s nothing suspicious in the kernel logs of the NetVM.
Anyway — if you face extremely slow connectivity in your Qubes OS setup, and downgrading your NetVM kernel to boot with a Qubes 4.2 kernel fixes your issue, you are affected by the same problem. Not sure what it is, but I hope a future VM kernel release fixes it.
I have another problem in sys-usb which makes it not boot when using the latest Qubes 4.3 VM kernel. That also requires me to stay stuck to 6.12 from Qubes 4.2. No details on that one yet.