Random delays in qrexec-client-vm

I am making a qrexec-client-vm call from one domU to another. All of the policy stuff seems to be correct (or it wouldn’t work).

Every once in a while, it takes about ten seconds (instead of returning in much less than a second) for this trivial call to execute. On adding debug output to the recipient of the request, I’ve discovered that sometimes it “hangs” for a while trying to read the two parameters I’m passing it.

So it appears that for some reason, it’s sometimes just slow passing the parameters (and I know that dom0 is in the “middle” of the process so this could be something going on on dom0).

Calling code:

LocalParams="$VOLUME
$SLOT
$READONLY
$TARGET
"

    qrexec-client-vm --prefix-data="$LocalParams" \
                     $STORE vera.store-checkout
fi

And receiving code on the VM named whatever $STORE is set to:

echo "Start" > /tmp/debug
echo $VOLUME >> /tmp/debug
echo $SLOT >> /tmp/debug

read VOLUME
read SLOT

echo "1" >> /tmp/debug

From doing a tail -f of /tmp/debug the delay is between the echo of Start and 1, and there’s nothing amiss shown as I echo $VOLUME and $SLOT (i.e., there isn’t a gigabyte of stuff in there somehow).

1 Like

I hesitate to offer wild speculation, but could it be a buffering problem? Maybe stdbuf could help, or simply adjust the buffer size with the argument to qrexec-client-vm.
Edit: Actually, scrub that… it sounds more like nonsense the longer I think about it.

I can reproduce this. Looks like a race condition related to --prefix-data. As a workaround, you can replace

qrexec-client-vm --prefix-data="$LocalParams" \
                 $STORE vera.store-checkout

with

{ printf %s "$LocalParams"; cat; } | qrexec-client-vm \
                 $STORE vera.store-checkout

I’ll file a bug later today.

3 Likes

Thanks, and BTW there is still an issue with doing a loop setup on a block, as read only. dom0 never sees it.

1 Like

I can’t reproduce this anymore with fedora-41-xfce/debian-12-minimal based VMs using a dom0-provided kernel-latest-qubes-vm-6.14.4-1.

2 Likes

OK, I’ll give read-onlys a try. However, I am using a debian kernel with my debian-12 minimal VMs because the Fedora kernels do not get along with ZFS when used on debian-12, so it’s possible that even with your experience I will still have a problem.

Readonly is still broken for me. (I’m able to impose read only at later steps in the process so it’s not a stopper.)

Your workaround does lead to an instantaneous call without fail.

The problem I’m now having is the script with the call in it never returns to the command line.

Never mind. Was using an older debian kernel. Seems to work OK with the latest Debian kernel.

Still i have an issue with the workaround’s qrexec call not returning (though it does instantly do what it’s supposed to do on the server side).

It’s the " cat ;" - it is waiting to read on its standard input, which it will then send to its stdout. If you want the subshell inside the {} to exit immediately, then take out the cat, or just pipe the “printf” straight to the qrexec-client-vm. (I think… unless you are already piping another stream in to the {}. Then the “cat” should exit when that stream is closed )

2 Likes

I’m a bit remiss. I tried doing that about two days ago and things worked. I should have said something back then.

I then spent some time trying to figure out what it was for–under what circumstances I’d want to use it–but without success.

I have cases where the service is supposed to return a string, but it does so even without the cat ;

In any case I still have a couple of slow calls (out of a couple of dozen I was making) but they are consistently slow (3-4 seconds each always, not 10 seconds some of the time) and likely either a totally different issue, or it’s actually supposed to be that slow.

I am afraid that even after reading your explanation I’m not sure what I’d use cat ; for, but it seems to be superfluous for all cases (at least for me).

I wrongly assumed you’re feeding stdin data to your calling script that the remote side should see after the prefix data. E.g. some-command | your-script or your-script <some-file. If not, then you can leave out the cat and indeed you have to, because it will hang if it’s not reading/writing any data and the remote side finishes first. My workaround syntax isn’t equivalent in that case.

Yeah that’s probably something different. When I tried 2000 qrexec calls in a row I also had a few of them take 2-3 seconds.

1 Like

If it is helpful…

On a fairly standard 4.2.4, when everything is idle, with debian-12-xfce based AppVMs, I see 0.1-0.2 seconds.
The only way I see close to 2 seconds -on this idle machine- is when launching a new AppVM or DispVM. The AppVM launch gives one slow qrexec The DispVM gives several slow qrexecs - maybe because of the creation of new volumes?

For Fedora-41-xfce, it looks a little slower, but my tests were not perfectly comparable.

Once the channel is created, then there is very little jitter or delay, although that is only useful if you have multiple messages to send, or if you can do the setup before you need to send the message.

I tested with a named pipe…

# make a named pipe
mkfifo myfifo
# use tail to repeatedly read the pipe and send the output to a single qrexec channel
# add an ampersand '&' to run it in the background.
tail -F myfifo | qrexec-client-vm targetvm servicename &
# throw lots of messages at the fifo:
while true; do
  echo $( date -Ins )  a_message >> myfifo
done

xen delivers the messages immediately. I’m not sure whether this simple setup will handle multiple simultaneous writers correctly, though.

One thing that will slay performance of absolutely everything done via dom0 is cloning a qube.

While it’s doing that clone, nothing else matters and qrexec calls, Qui manager menus, device menus, will not function until its done. Sometimes if I’m really lucky one thing will happen halfway through the clone…otherwise things just wait for it to finish.

1 Like

I’m pretty confident my remaining timing issues have nothing to do with qrexec per se, I’m just asking for something that takes time at the other end. The fact that the two calls consistently take a total of 8-9 seconds indicates that to me.

I’m trying to debug that but the calls are being processed by a daemon I wrote and apparently I’m not allowed to open (much less write to) a debug file from within a daemon–so I’ll have to copy that code to something else to test it, to try to find the inefficiency that’s taking up so much time.

Still, the string of calls I am making as a whole is notably less obnoxious without the random 10 second delays in it.

1 Like