Large/Consistent Volume of Intranet Traffic Causes Networking to Fail

I’m having an issue with networking in Qubes OS v4.1 on a Librem 14v1 laptop. Whenever any of my qubes facilitates a large or consistent volume of intranet* traffic**, all networking fails eventually (100% of the time)***.

*I only say intranet here because I don’t do large, consistent downloads or uploads over the internet, so I’m not sure if this issue is encountered by internet traffic as well.

**I started noticing this when backing up a NAS over the intranet via rsync. Every time a backup involves consistent file transfers that span ~45 minutes or so, networking fails. This shouldn’t be that stressful for Qubes because the NAS can only output data at a rate of ~5 MB/s. When I review the rsync log, I notice that the transfer rate stays steady at ~4-6 MB/s for several minutes, then starts dropping to 0, at which point it will throw an error stating the host is down.

***This isn’t just an issue with any particular qube. When I encounter this, ALL networking on the laptop stops working - no internet traffic, no intranet traffic, on any qube. Not even restarting sys-net and sys-firewall fixes this, leading me to believe this issue may go down to dom0. The following does get networking working again:

  1. Rebooting the whole system
  2. Most of the time, waiting a certain amount of time (~5 minutes) allows networking to start working again. In cases where waiting “doesn’t work”, it’s not clear to me whether only a reboot will fix it or if I would otherwise just need to wait longer than usual or longer than I’m willing to wait. When I do this in lieu of a reboot, it does seem as though networking remains slower and more prone to seizing up again until I do a reboot.

I’m going to pre-answer some obligatory questions here:

  1. I don’t believe this is an issue with my network because all of my other devices are working fine when I run into this issue
  2. I don’t believe this is an issue with my NAS because I can run the same backup on my Windows systems without issue (I’ve had this run for 10+ days on Windows without issues)
  3. I haven’t made any modifications to my kernel or dom0 (other than installing updates)
  4. This shouldn’t be an issue of amount of RAM (I have 64 GB) or disk space (I have 2 TB)
  5. I don’t believe this is particular to rsync because I encountered this issue while manually downloading many files from another system on my network

It seems as though Qubes networking software has either a memory leak or something that’s not very scalable (perhaps logging) that’s causing networking to grind to a halt under sustained traffic. This may explain why it usually starts working again after a certain number of minutes - perhaps something like logging it getting overwhelmed and needs time to flush.

Does anyone have any suggestions on how to fix this? What used to be letting a simple backup script run in the background in Windows has turned into a very time-consuming process in Qubes because networking fails several times per backup.

I do not know what wifi interface a Libre14v1 has.

But the general guidance for service qubes dealing directly with hardware (sys-net, sys-usb) and current Qubes OS default memory associated with those sys-usb and in this case, sys-net, is known and reported repeatedly over github issues.

I would advise to read this:

Inspecting logs from sys-net in a terminal through sudo journalctl will tell you what is happening at the time you lost network, while the solution is most probably to finetune memory allocated to sys-net. Start by giving 100mb more and see if the problem disappears. Then devise by two. Then again until problem reappears.
You could also try to base your sys-net on debian template instead of fedora.

Fedora grows larger in memory footprint at each release:

Your story doesn’t tell if you upgraded in place from Q4.0, but Q4.1 has 400mb as default for sys-net.
If it needs more then 450 or you continue to loose network because your network drivers are having memory leaks, please search issues on github and open a new one there.

1 Like

Thanks for the reply. Here are some updates:

  1. I installed Qubes OS 4.1 directly (no upgrade)
  2. I did try reviewing journalctl but ran into a separate issue where there were so many messages in it, it was hard to get anything from it. I don’t remember what the offending messages were, but I did see that others were having this problem.
  3. Unfortunately, a dom0 update just bricked the private storage from all of my qubes (Dom0 Update Bricked All Qubes), so I can’t even start sys-net. If I can fix this, I will try increasing the memory. I didn’t earlier because with the “Max memory” field disabled, this would imply to me that there is no max memory (i.e. it’s indefinite), so I didn’t think this would be necessary. If this is the issue, I would also find it curious as to why it would be set so low that you can’t do basic things without the qube running into stability issues.

Before my system got bricked, I was considering trying switching sys qubes from fedora to debian. Any indication as to which one should generally be more reliable?

Hardware related service qubes require static amount of assigned memory at boot. Those being greyed out is because you cannot assign different minimal and maximum memory values. There is no memory ballooning on HVM virtual machines, which is what Qubes depends on yo provide PCI device passthrough, required for sys-net and sys-usb.

Other “PVH” based qubes (normal qubes) provide memory ballooning so that memory can be “stolen” from a qube where available and given to another where needed.

The easy way would be to start a terminal over sys-net, and then ‘sudo journalctl -f’ to track logs as they are produced until your behavior shows a corresponding kernel error message on screen.

Otherwise, logs need to be searched for. A good starting point is to search for errors, with something like ‘sudo journalctl --boot 0 | grep -i error’ to search logs from current sys-net boot session for error, non case sensitive.

You will have to fix your vm-pool error first, pointed in other issue first, of course.

As stated prior, fedora is more memory hungry then debian:

It is not more “reliable” in that sense, just more lightweight.