Sys-net: sudden network loss under intense load, after some time [Q4.2]

Starting a couple of months ago, sys-net started to cut off the network connection after some time under intense load.

I’m trying to figure out how to get to the root cause of this. So, this question is more like a brainstorming one.

What Is Happening?

I’m doing a borg backup to my Synology NAS which copies ~100 GB of data for a couple of hours.

This specific backup used to work fine for years. Until it stopped working around April when the network cut-offs started to happen. Suddently, after doing the backup for about 30 minutes up to a couple of hours, at some point the network connection is gone. That is, until I restart sys-net, which fixes it.

It’s 100% reproducible, although the time varies until it happens. But at some point it cuts off.

Changes to the System

There were no relevant hardware-related changes. Maybe an additional hard drive. But not related in my view.

The biggest change was the upgrade from Qubes 4.1 to 4.2. The issues started to happen after the upgrade although I’m not sure how it’s connected. Might as well be a kernel update around that time. Or Fedora/Debian upgrade to a more recent version.

what Did I Try?

So far, I played with sys-net:

  • tried different (Qubes-provided) kernel versions - no avail
  • tried Debian template instead of Fedora template
  • tried increasing memory

No change.

** Diagnostic Information**

I did a dmesg --follow and journalctl --follow during the backup and here’s what they had to say around cut-off time:

dmesg

[    5.248230] r8169 0000:00:06.0: Direct firmware load for rtl_nic/rtl8168f-1.fw failed with error -2
[    5.248263] r8169 0000:00:06.0: Unable to load firmware rtl_nic/rtl8168f-1.fw (-2)
[    5.249923] RTL8211E Gigabit Ethernet r8169-0-30:00: attached PHY driver (mii_bus:phy_addr=r8169-0-30:00, irq=MAC)
[    5.298235] r8169 0000:00:06.0 ens6: No native access to PCI extended config space, falling back to CSI
[    5.302369] r8169 0000:00:06.0 ens6: Link is Down
[    8.106798] r8169 0000:00:06.0 ens6: Link is Up - 1Gbps/Full - flow control off
[   28.424771] vif vif-14-0 vif14.0: Guest Rx ready
[   35.645562] vif vif-15-0 vif15.0: Guest Rx ready
[ 4343.345700] r8169 0000:00:06.0 ens6: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5678 ms

journalctl

Aug 31 23:00:05 minint-7m0rjgf.core.localnet kernel: r8169 0000:00:06.0 ens6: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5678 ms
Aug 31 23:30:57 minint-7m0rjgf.core.localnet systemd-timesyncd[447]: Timed out waiting for reply from 188.174.253.188:123 (2.debian.pool.ntp.org).
Aug 31 23:31:07 minint-7m0rjgf.core.localnet systemd-timesyncd[447]: Timed out waiting for reply from 85.215.93.134:123 (2.debian.pool.ntp.org).
Aug 31 23:31:17 minint-7m0rjgf.core.localnet systemd-timesyncd[447]: Timed out waiting for reply from 129.70.132.32:123 (2.debian.pool.ntp.org).
Aug 31 23:31:27 minint-7m0rjgf.core.localnet systemd-timesyncd[447]: Timed out waiting for reply from 185.232.69.65:123 (2.debian.pool.ntp.org).

lspci -v

00:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09)
	Subsystem: ASUSTeK Computer Inc. P8 series motherboard
	Physical Slot: 6
	Flags: bus master, fast devsel, latency 0, IRQ 40
	I/O ports at c200 [size=256]
	Memory at f2018000 (64-bit, prefetchable) [size=4K]
	Memory at f2010000 (64-bit, prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 01
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
	Capabilities: [d0] Vital Product Data
	Kernel driver in use: r8169
	Kernel modules: r8169

The device is sttached to sys-net.

Related Issues

Hard to find issues in Qubes context describing such an issue. Might be a generic hardware/driver issue.

Came across posts related to different NAS systems reporting NETDEV WATCHDOG timeouts.

What’s Next?

That’s where I need help. Where to go from here?

Options I consider:

  1. changing network-related settings?
  2. instead of the r8169 driver, use r8168 instead (but how??)
  3. trying a different network device (last resort)

Reading the internet, option 1 doesn’t seem to work very often, option 2 sounds promising, and option 3 is for the desperate (I am…).

How would you proceed?

On proxmox forum there is thread that say using i8169 driver with i8168 makes this problem. There was this bug with new kernel - used i8169 module for i8168 hardware.

PS: there also might be a problem with realtek power saving - try disable aspm in kernel boot options

pcie_aspm=off
1 Like