Errors when guest performs block device I/O

etienne · March 16, 2025, 5:23pm

I’m persistently encountering errors in my appvms that center around access to their virtual block devices. Here’s an example of a typical guest kernel log:

[2025-03-16 00:08:16] [87388.333033] audit: type=1334 audit(1742098096.665:178): prog-id=65 op=LOAD
[2025-03-16 00:08:16] [87388.376073] audit: type=1130 audit(1742098096.708:179): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=unbound-anchor comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.376107] audit: type=1131 audit(1742098096.708:180): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=unbound-anchor comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.390928] audit: type=1130 audit(1742098096.723:181): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=logrotate comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.390968] audit: type=1131 audit(1742098096.723:182): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=logrotate comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.394939] audit: type=1130 audit(1742098096.727:183): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.394976] audit: type=1131 audit(1742098096.727:184): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:08:16] [87388.411757] audit: type=1334 audit(1742098096.744:185): prog-id=65 op=UNLOAD
[2025-03-16 00:27:41] [88553.111577] audit: type=1334 audit(1742099261.444:186): prog-id=66 op=LOAD
[2025-03-16 00:27:41] [88553.111836] audit: type=1334 audit(1742099261.445:187): prog-id=67 op=LOAD
[2025-03-16 00:27:41] [88553.111884] audit: type=1334 audit(1742099261.445:188): prog-id=68 op=LOAD
[2025-03-16 00:27:41] [88553.111906] audit: type=1334 audit(1742099261.445:189): prog-id=57 op=UNLOAD
[2025-03-16 00:27:41] [88553.111932] audit: type=1334 audit(1742099261.445:190): prog-id=58 op=UNLOAD
[2025-03-16 00:27:41] [88553.155078] audit: type=1130 audit(1742099261.488:191): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:27:41] [88553.613202] audit: type=1334 audit(1742099261.946:192): prog-id=69 op=LOAD
[2025-03-16 00:27:41] [88553.613279] audit: type=1334 audit(1742099261.946:193): prog-id=70 op=LOAD
[2025-03-16 00:27:41] [88553.613309] audit: type=1334 audit(1742099261.946:194): prog-id=71 op=LOAD
[2025-03-16 00:27:41] [88553.614426] FAT-fs (xvda1): bogus number of reserved sectors
[2025-03-16 00:27:41] [88553.614443] FAT-fs (xvda1): Can't find a valid FAT filesystem
[2025-03-16 00:27:41] [88553.618552] audit: type=1130 audit(1742099261.951:195): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pcscd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:28:11] [88583.248271] kauditd_printk_skb: 1 callbacks suppressed
[2025-03-16 00:28:11] [88583.248277] audit: type=1131 audit(1742099291.581:197): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:28:11] [88583.283817] audit: type=1334 audit(1742099291.617:198): prog-id=66 op=UNLOAD
[2025-03-16 00:28:12] [88583.664295] audit: type=1131 audit(1742099291.997:199): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-timedated comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 00:28:12] [88583.676908] audit: type=1334 audit(1742099292.010:200): prog-id=71 op=UNLOAD
[2025-03-16 00:28:12] [88583.676968] audit: type=1334 audit(1742099292.010:201): prog-id=70 op=UNLOAD
[2025-03-16 00:28:12] [88583.677011] audit: type=1334 audit(1742099292.010:202): prog-id=69 op=UNLOAD
[2025-03-16 05:56:16] [108268.330704] audit: type=1130 audit(1742118976.675:203): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=raid-check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 05:56:16] [108268.330809] audit: type=1131 audit(1742118976.675:204): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=raid-check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 05:56:17] [108268.583076] audit: type=1130 audit(1742118977.055:205): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=qubes-sync-time comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 05:56:17] [108268.583171] audit: type=1131 audit(1742118977.055:206): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=qubes-sync-time comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2025-03-16 11:16:28] [127480.393831] I/O error, dev xvda, sector 18204816 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[2025-03-16 11:16:28] [127480.396712] I/O error, dev xvda, sector 5469048 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 0
[2025-03-16 11:16:28] [127480.400048] I/O error, dev xvda, sector 17496608 op 0x1:(WRITE) flags 0x9800 phys_seg 2 prio class 2
[2025-03-16 11:16:28] [127480.400076] Aborting journal on device xvda3-8.
[2025-03-16 11:16:28] [127480.402710] I/O error, dev xvda, sector 17455104 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[2025-03-16 11:16:28] [127480.402755] I/O error, dev xvda, sector 17455104 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[2025-03-16 11:16:28] [127480.402770] Buffer I/O error on dev xvda3, logical block 2129920, lost sync page write
[2025-03-16 11:16:28] [127480.402793] JBD2: I/O error when updating journal superblock for xvda3-8.
[2025-03-16 11:16:28] [127480.402830] EXT4-fs error (device xvda3): ext4_journal_check_start:84: comm systemd-journal: Detected aborted journal
[2025-03-16 11:16:28] [127480.416779] EXT4-fs (xvda3): Remounting filesystem read-only
[2025-03-16 11:16:28] [127480.416967] I/O error, dev xvda, sector 5465232 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 0
[2025-03-16 11:16:28] [127480.420057] I/O error, dev xvda, sector 5463048 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[2025-03-16 11:16:29] [127480.558017] I/O error, dev xvdc, sector 858112 op 0x1:(WRITE) flags 0x1004000 phys_seg 32 prio class 0
[2025-03-16 11:16:29] [127480.558051] Write-error on swap-device (202:32:858112)
[2025-03-16 11:16:29] [127480.558069] Write-error on swap-device (202:32:858120)
[2025-03-16 11:16:29] [127480.558083] Write-error on swap-device (202:32:858128)
[2025-03-16 11:16:29] [127480.558096] Write-error on swap-device (202:32:858136)
[2025-03-16 11:16:29] [127480.558109] Write-error on swap-device (202:32:858144)
[2025-03-16 11:16:29] [127480.558122] Write-error on swap-device (202:32:858152)
[2025-03-16 11:16:29] [127480.558214] Write-error on swap-device (202:32:858160)
[2025-03-16 11:16:29] [127480.558230] Write-error on swap-device (202:32:858168)
[2025-03-16 11:16:29] [127480.558243] Write-error on swap-device (202:32:858176)
[2025-03-16 11:16:29] [127480.558257] Write-error on swap-device (202:32:858184)
[2025-03-16 11:16:29] [127480.560731] I/O error, dev xvdc, sector 730960 op 0x1:(WRITE) flags 0x1000000 phys_seg 22 prio class 0

This occurs in all of my appvms, but there is notably no trace of any problem in dom0 (I’ve checked the kernel logs, systemd logs, btrfs stats on dom0, and I’ve scrubbed the dom0 drives containing pools in question).

A few stats on my system:

Qubes 4.2.3
Btrfs filesystems in dom0
Modified Fedora 40 xfce templates (modifications mainly involve additional software and UI scaling)

What I’ve tried:

Various combinations of memory ballooning, virtualization mode, guest kernel.
Upgraded qubes-utils from 4.2.17 to 4.2.19 and confirmed that direct I/O went from being enabled to being disabled (to rule out that my issue was related to Disable direct IO on Btrfs based storage #9488).

There seems to be some correlation to memory pressure, since the errors appear on the appvms with the lowest memory. On VMs with ample memory, the errors still occur, but with lower frequency. Swap use doesn’t seem to be a necessary condition.

At this point, I suspect there’s an issue with xen, xen_blkback, and/or xen_blkfront. Any advice on how to debug this problem further would be greatly appreciated.

rustybird · March 16, 2025, 5:50pm

When did it start?

I’d try a few VM kernel versions (also switching from normal to -latest kernels, or vice versa) to see if that changes anything [edit: sorry, I missed that you already tried that] and if not: the same with host kernels.

rustybird · March 16, 2025, 6:29pm

Do you also see errors for xvdb (the ‘private’ volume)? It would be curious if they only affected xvda (‘root’) and xvdc (‘volatile’).

etienne · March 16, 2025, 7:14pm

Thanks @rustybird. The errors go back as far as I have logs. I should also have mentioned that these errors occur for regular vm’s and dvm’s alike. All devices, xvda, xvdb, and xvdc are affected, though xvda—and to some extent xvdc—are by far most involved (thankfully). The problem usually starts as a read error on xvda, but errors from other devices then quickly pile on.

rustybird · March 17, 2025, 11:27am

Very strange. Normally I’d say this smells like a hardware problem - bad RAM/storage/PSU/etc. - but I can’t wrap my head around there being no dom0 kernel errors.

Maybe open a bug if you don’t get any better response here?

etienne · March 17, 2025, 6:52pm

Yeah, this is strange. I run 128 GB of ECC memory. I haven’t run memtest or gotten around to getting edac working yet, so for the moment I’ll just assume the memory is ok, though I will need to confirm.

I’m inclined to think it’s xen or its drivers. I came across your github discussion on xen_blk*, which suggests to me that the existence of additional bug(s) in the driver(s) is not that farfetched.

There’s a couple more things that come to mind:

I had previously increased dom0’s maximum memory to 32 GB to accommodate what I presume is a memory leak in Borg. This shouldn’t have a negative impact, but I doubt there that many systems out there with so much dom0 memory. Maybe it’s possible this configuration elicits a race condition or execution of an infrequent code path?
I noticed that by default the fedora-40-xfce and debian-12-xfce templates load a bunch of backend (host) kernel drivers (xen_scsiback, xen_netback, xen_blkback). These are loaded via /usr/lib/modules-load.d/xen.conf, which on Fedora is provided by the package xen-runtime. It appears that xen-runtime is installed by default, for reasons I don’t understand. I’ve unloaded the host modules on my running guests to see if this might fix things.

etienne · March 27, 2025, 6:58am

Here’s the epilogue to my troubleshooting efforts. After trying many different things with no luck, I finally restored the lower and upper bounds of dom0 memory back to their default values (1 GB and 4 GB). That appears to have completely solved the issue and also seems to have alleviated another problem where guest windows would sporadically fail to be drawn and remain blank white rectangles. I assume there must be a problem with Xen. When I have the time, I’ll try to come up with a minimal example that elicits this behavior so that I can file a bug report.