Qrexec-agent not starting after VM update

yann · July 23, 2021, 9:55pm

My “stage1 update failure” report did not inspire much, but I suspect it is the root of my stage2 woes below, which seem to warrant their own topic.

What’s holding me right now is that “qubes-dist-upgrade --template-standalone-upgrade”
installs a qrexec-agent that fails to start (as described here).

strace reports /dev/xen/evtchn not sending data when expecting for it:

26022 openat(AT_FDCWD, "/dev/xen/evtchn", O_RDWR|O_CLOEXEC) = 4
26022 ioctl(4, IOCTL_EVTCHN_BIND_UNBOUND_PORT, 0x7ffd0f540504) = 24
26022 write(4, "\30\0\0\0", 4)          = 4
...
26022 openat(AT_FDCWD, "/dev/xen/privcmd", O_RDWR|O_CLOEXEC) = 5
26022 openat(AT_FDCWD, "/dev/xen/hypercall", O_RDWR|O_CLOEXEC) = 6
26022 openat(AT_FDCWD, "/dev/xen/privcmd", O_RDWR|O_CLOEXEC) = 7
26022 ioctl(7, _IOC(_IOC_NONE, 0x50, 0xff, 0), 0) = -1 ENOTTY (Inappropriate ioctl for device)
26022 openat(AT_FDCWD, "/dev/xen/privcmd", O_RDWR|O_CLOEXEC) = 8
26022 openat(AT_FDCWD, "/dev/xen/hypercall", O_RDWR|O_CLOEXEC) = 9
26022 openat(AT_FDCWD, "/dev/xen/privcmd", O_RDWR|O_CLOEXEC) = 10
26022 ioctl(10, _IOC(_IOC_NONE, 0x50, 0x5, 0x10), 0x7ffd0f540110) = 0
26022 read(4,

I still wonder what went wrong in dom0 on stage1, but I guess some
critical service has gone missing, and maybe was not health-checked
afterwards ?

yann · July 24, 2021, 2:06pm

(still learning about the architecture here, please bear with me)

I can see that ´qrexec-daemon´ for the updating vm, while still running after packages were updated (waiting on the final shutdown of unnecessary VM prompt), has gone after qrexec-agent restart attempt, while its children linger:

  |-qrexec-policy,4644 /usr/bin/qrexec-policy -- 206 debian-10 dom0 qubes.WindowIconUpdater SOCKET8
  |   `-qrexec-client,4645 -d dom0 -c SOCKET8 debian-10 206 QUBESRPC qubes.WindowIconUpdater debian-10 name dom0
  |       `-icon-receiver,4646 /usr/lib/qubes/icon-receiver
  |           `-logger,4649 -t qubes.WindowIconUpdater-debian-10 -f /tmp/qrexec-rpc-stderr.4646

dom0 logs show:

[root@dom0 ~]# cat /var/log/qubes/qrexec.debian-10.log  
eintr
eintr
eintr
eintr
eintr
eintr
eintr
eintr
eintr
libvchan_is_eof
[root@dom0 ~]#

There are no timestamps in there, but I can gather that:

the 9 eintr events occur while the update script is running
the libvchan_is_eof event occurs when I systemctl stop qubes-qrexec-agent.service, and so does the death of qrexec-daemon

… but then, all of this happens with a plain 4.0 template VM, apparently the real problem still evades my understanding.

Debugging what’s happening in the VM when it boots is less obvious, any hints on doing that ?

yann · July 25, 2021, 1:38pm

The reason appears to be in the qrexec.$VM.log:

Incompatible agent protocol version (remote 3, local 2)
Unable to init server: Could not connect: Connection refused

(zenity:9600): Gtk-WARNING **: cannot open display: 
failed

After reviewing what each stage does, I’m left with the impression that stage1 updates VM with 4.1 qrexec-agent, which fails to work with 4.0 qrexec-daemon, and that the 4.1 qrexec-daemon only gets installed in stage4.

But OTOH @marmarek 's announcement states:

If any stage fails, it should be safe to simply retry (potentially after
fixing what it was complaining about).

… but it really looks like that does not hold for stage2, with VMs not being able to start any more.

Is there anything I missed, @fepitre ? Is it safe to proceed after stage2, even though re-running stage2 itself just does not work ATM ?

marmarek · July 26, 2021, 9:30am

Indeed this is exactly what happens. New qrexec-daemon works with old qrexec-agent, but not the other way around. The VM update is put before dom0 update only to allow saving one reboot during the process - otherwise VM can be updated after dom0. And in fact you can do that - proceed with subsequent stages and retry sage 2 at the end.

yann · July 28, 2021, 9:51pm

So I went that route, as with some VMs already updated it looked the most reasonable option. But though stage3 went without an issue, there are a few more issues down the road for me.

stage 4

First, though stage4 exits as if everything went fine (ends with “Complete!” and the usual info message), but as I took a close look at the 14Klines of recorded output (not counting the “set -x” lines), I can see:

a first 465-packages upgrade, with no issue
a second one (304 installs, 929 upgrades, 33 removals, 2 downgrades (?!)), apparently ran in 3 passes, and with a number of messages (transcribed here by hand in short just to get an idea):
- dbus-daemon: failed to preset dbus.service (already exists as a symlink)
- selinux-policy-minimum (?, and other occurences): “no such file” for initial-setup-*.service
- kernel-qubes-vm: qubes-prefs error: no such property “default_kernel”
- microcode: no such file /sys/power/resume
- initial-setup: no such property “default_kernel”
- then immediately a python stack trace for “empty response from qubed”, pointing to journalctl, see below
- networkmanager: failed to wait for deamon to reply, broken pipe
- ?: grub2-get-kernel-settings: command not found
a third one for grub2 packages, with no issue, maybe making the last one less of a problem

Now about the qubed issue (qubed is effectively not running, and the logs since then are full of connection attempts which always fail, leaving the system unable to launch a VM (at least using usual qubes means).

The logs show:

qmemman.daemon.xswatcher failing to do_balance with Permission Denied, apparently on xc.domain.getinfo()
starting qubesd, repeated several times:
- AppVM has no attribute _qubesprop_maxmem (whose handling triggers nested exception Permission Denied on xc.physinfo() from get_free_xen_memory()
- libvirtd: virNetSocketReadWire I/O error / EOF while reading data

Any idea what went wrong there ? Suggestions to get outa here from just-a-dom0 with no VM running greatly appreciated

stage 5

I also tried stage5, which showed errors of its own:

before starting stage5 proper: “xl info xen_commandline” fails with libxl errors (failed to retrieve max num of cpus; permission denied getting physinfo and current sched id
“efibootmgr -c”: could not prepare boot variable: no space left on device

This last update stage at least exited in error.

Strace says this happens when efibootmgr requests a 110-byte write into /sys/firmware/efi/efivars/Boot0000-${SOME_UUID}. Note there are existing Boot0001 to 0005 with that same UUID.

There are a number of suggestions on the net on how to deal with this,

some suggesting that CSM should be disabled in the BIOS (which I’m pretty sure is already the case, but I’m a bit reluctant to attempt a reboot at this point just to check)
some suggesting to make some room by removing some of the “dump*” entries in efivarfs, as being logs
some suggest using efi_no_storage_paranoia, but here again reboot required

Any comment on those, or other suggestions ?

marmarek · July 28, 2021, 11:07pm

Stage 4 performs the most of the dom0 upgrade, after that a reboot is needed (to boot new Xen version) to start any VM, or even qubesd. There indeed should be clearer message about it.

As for the other issues:

Those are rather harmless, mostly a consequence of a large version differences between packages in R4.0 and R4.1.

This is expected before rebooting.

This is worrying. It means it failed to set the boot entry.

If you have them, yes, try this path, they could use up a lot of space.

yann · July 29, 2021, 10:20pm

That did help, thanks!

Once fully upgraded there are a couple of smallish issues left:

on login I get a transient notification about some keyboard failing to get attached from sys-usb to dom0, though I can see no problem using this laptop’s keyboard
sys-net did not look fully started in the panel applet on first boot (starting spinner kept spinning), though AFAICT it was fully working
fedora-32 TemplateVM (and those based on it) lost most of its menu entries, leaving only Terminal and Software, whereas a fedora-32 standalone VM kept Firefox and Qubes Settings
the xfce keymap-switcher panel applet was removed from the panel (just had to add it again)
while in some VMs the originally-selected spanish keymap is active, in some others’ apps I was getting a US keymap. Just had to reselect the already-selected spanish keymap in the applet to get things straight again, but I find the discrepancy between VMs puzzling