Qubes lack connectivity based on how they are started

I can’t reproduce it on my system

Installed a fresh minimal template and added the networking-agent and firefox to the template, and then I made 5 test qubes with the template.

I tried using the whiskermenu and starting firefox as fast as possible, all 5 qubes worked. I tried doing the same from the terminal with 5 terminals open and the command ready to execute, same result all 5 qubes worked. Tested both methods a second time with the same results.

I am using a reasonable fast desktop CPU, if this is a race condition it might not trigger on my system.

@cayce all as in all … there are only debian minimal based templates/qubes. I also documented how I create them in this forum and on my website (unfinished draft).

@enmus very interesting. I’ll give this a try soon.

1 Like

What kernel are you running on dom0 and vms?

I’ve also been running an all-debian minimal setup (kernel-latest for dom0 and 5.10 for vms) for months now without a single issue, so the situation you’ve described is indeed weird.

1 Like

@bebf738vd dom0 and all qubes run 5.15.81-1.fc32.qubes.x86_64

New templates did not solve the issue, now recreating qubes and moving app data manually. If that doesn’t fix it I will take @enums advice and (temporarily?) switch my stuff over to fedora-minimal based.

I’m in the same boat as @BEBF738VD … Indeed sounds very strange; especially with the t430. :thinking:

Is it possible some update/policy was pushed to your upstream router? Have you the same experience when leveraging an alternate uplink? Might be worth heading to a local cafe to see if the problem persists.

Outside of this, have you grepped your logs for any sign of the hardware issues/failure?

Reaching, I know ..

Have any animals in the household? Maybe upon the case up and hit it with some compressed air?

When you decide this, you can PM me and I can send you my notes on creating different templates and use cases. They look for example like:

fedora-37-min-sys-usb-xHCI-template
--------------------------------
mlocate qubes-input-proxy-sender qubes-usb-proxy usbutils 
Maybe it is needed
[user@dom0 ~]$ qvm-pci attach --persistent --option permissive=true sys-usb dom0:00_14.0
fedora-37-min-sys-firewall-template
-------------------------------
iproute iptables-legacy iptables-legacy-libs iptables-libs nftables qubes-core-agent-dom0-updates qubes-core-agent-networking tinyproxy

etc…

Ok, here are all the things I did:

  1. recreated all templates based on debian-11-minimal
  2. recreated all qubes from scratch based on the new templates and then manually imported the respective settings and data
  3. recreated system and web template based on fedora-36-minimal

In all cases I get the same behavior:

  • the dispvm always has connectivity
  • the mail qubes always have connectivity
  • the other web qubes sometimes have, sometimes don’t … in two cases they lost connectivity while running (other qubes remained online)

I see no hints in dom0 logs.

I cannot overstate how stressful this is. This machine has been my daily driver for a long time. My setup is stable-stable. No tweaking not even installing new apps. I’ve been using it the way it was for months.

Karma?

I’m likely off-base but:

Does it make sense to investigate any iptables-related logging (I have not looked up yet to see what all keeps logs - iptables not among them?) for the purpose of checking on the web VMs that have then lose then regain connectivity? You say that you looked at dom0 logs but maybe selectively with key VMs (sys-net or equivalent et al) some highly specific networking-related software for any that keep logs? If you modded enough to be able to run wireshark on a few VMs for internal data collection points (within a web VM or outside of it in a VM further along the path), then perhaps you could collect usable info about what actually happens when the loss happens.

Maybe Layer 7?

CVE-2022-23529

Thanks for all the input. I will do a complete reinstall and start off with standard Fedora templates and create my qubes from scratch. I won’t make any changes in dom0 at all. Not even whisker menu or redshift.

Then I will test this on both of my identical T430.

If I still see issues then, I guess I’ll file a bug report.

1 Like

@Sven

is it possible that the root cause is related to VPN kill switch feature ?

sometimes i have similar issue, sometimes can / cannot connect,
because i forget if the VPN kill switch feature is on,
so i don’t have connection while VPN sometimes cannot connect.

@newbie I don’t use VPN (anymore).

Things I have tried:

  • complete reinstall with defaults (no dom0 changes, no BTRFS)
  • switched all templates to fedora-minimal base
  • all new qubes from scratch (manually imported data)
  • experimented to find ways to restore connectivity (kill/restart the qube, sys-firewall, sys-net, all qubes)
  • experimented with ways to make the issue happen (most likely: CPU load or disk I/O)

Results:

  • the problem still happens, caused by CPU or I/O load. In a few cases the connectivity was lost when other qubes got real busy (e.g. extracting a large tar ball)
  • while I have multiple reliable ways to make the issue happen (start multiple qubes rapidly or make some qubes consume a lot of CPU, I/O) I could not identify a reliable way to recover. Shutting all qubes down, letting the system become idle and then slowly starting them up one after the other was the only way that worked most of the time.
  • (@unman, @enmus) to my large surprise none of the freezes happened during any of the update/install steps when using fedora-minimal based qubes or everything
  • performance on my T430 when not using BTRFS is horrible, the fan blares basically non-stop, everything is real slow to the point of being unusable (increasing memory to the qubes doesn’t have any measurable impact)

Next steps:

  • I need to work and have already lost 2 days on this, need to pull the plug.
  • the performance issues along with the freezes when using debian-minimal leave me no choice at the moment but to revert my main system / daily driver to R4.0.4 (I understand the implications; this will still be more secure than running another OS on bare metal)
  • when there is free time on weekends or holidays I will use my second T430 to occasionally check back on R4.1 to see if issues got addressed.

I am a bit rattled at how quickly my beloved stable setup went to utter useless in a few days. Need to chew on that. Happy side note is that if you have your setup well documented (like I do) switching from debian to fedora or to another version is trivial and not nearly as involved as you’d think.

3 Likes

Have these words stuck right above your display.

Then you’d know if it’s about the hardware if outcomes would differ?

@Sven

not sure whether a good alternative solution,
but maybe you want to try,
try to create 3 sys-net & 3 sys-firewall,

  • 1 is connected via wifi card,
  • 1 is connected via USB dongle to wifi,
  • 1 is connected via ethernet

Short update: I’ve downgraded to R4.0.4 to make sure it’s not hardware related and that I haven’t just imagined “everything” being so much smoother and reliable in the past. After a month of daily usage I can report…

Qubes OS R4.0.4
Xen 4.8.5-42.fc25
Kernel 5.4.190-1
Templates all debian-minimal
Filesystem BTRFS
  • super stable
  • very fluid
  • minimal resources needed
qube type memory
sys-* 250 MB
web app 400 MB
mail 500 MB
web 1000 MB
windows 4000 MB

startup time of web qube (buster): 5.05 - 5.25
startup time of web qube (bullseye): 5.36 - 5.59

I am sure now that I didn’t just imagine this T430 being the most stable, fluid and secure system I’ve ever had. What has happened (subjective and specific to my behavior, hardware and environment) is a steady degradation of the experience. It wasn’t that “quickly” as was my initial impression. First things needed more memory and CPU (R4.1), then the machine started freezing during updates and finally I landed in the hell described in this thread. After working from home since January 2020 this last issue started happening on the first day of my first business trip in 3 years. It was horrible timing.

Anyway. I am still not able to put my finger on what’s causing all of this. I have restored a sane environment for me to work, but it’s EOL. Luckily I have two identical machines and after making sure they are both working the same, I will then advance on one of them back to R4.1 and report back here what I am seeing.

If the problems don’t happen anymore we can close the case as “Sven doing something and a reinstall fixed it”. If that’s not the case, I’ll start creating qubes-issues and provide debug information as request. Finally I (we) must face the possibility that these old machines are just no longer supported well (by Xen and maybe by extension Qubes OS). If that’s the case and other T430 users see the same thing it might be time to remove them from the recommended list.

2 Likes

Thanks for an interesting summary @Sven. I look forward to seeing the
results of your test.
I share your feeling that 4.1 is not a good fit with the older machines:
this is disastrous given that they remain the only certified hardware
that Qubes has.

The only way that I have been able to get anything like decent
performance on 4.1 on a range of x220 and x230 is by allocating more
memory to each qube, scaling back on the number of concurrent qube
sessions, and reaping qubes almost as soon as I am finished with them.
On 4.0.4 I was able to keep unused qubes hanging around in case they
were needed, which made for a much better experience.

I notice significant performance hits when new qubes are started and
when qubes are shut down. My feeling is that memory management is less
performant on 4.1, and this is particularly noticeable at these
transitions. I think this accounts for the major risk of a system
crash during updates when a large number of qubes are cycling state in
succession. I have been able to reduce the risk of this by extending the
update process with pauses between each cycle.

I never presume to speak for the Qubes team. When I comment in the Forum or in the mailing lists I speak for myself.
3 Likes

There are a few main differences between R4.0.4 and R4.1.

  • Xen version
  • Dom0 kernel version
  • VM kernel version
  • Qubes infrastructure (qrexec, core-admin-agent, etc.)
  • Debian template version

Any of them might result in the instability. Some of these are more or less easier to bisect than others. For example, it’s easier to test different dom0 kernels than Xen versions. If unfortunately, R4.1 were confirmed to be unstable on X230 with debian-minimal templates, then I would suggest start building things and installing them to test. Maybe the workflow of openQA can be used for reference.

This is well known.
It has been reported (and this was my experience) that a clean install
of 4.1 did not evidence these issues, and that they developed after a
series of updates.

1 Like

I’m still rather suspicious of the changes in/to the newer xen memory grants infrastructure leading to much instability…perhaps through memory fragmentation…or something else.

B

I’m back on R4.1 for a bit over a week and things are stable.

I can’t observe the original issue of this thread nor any crash/freeze. My conclusion is that in my specific case the root cause more likely then not was in my version of coreboot/heads I build myself (v0.2.0-1150) from the osresearch/heads repository. Once I switched to v1.4 of the Nitrokey/heads repository no more issues can be observed:

  1. The R4.1 installer was always crashing on me and it took 2-3 attempts to install Qubes OS. In hindsight that was an obvious sign of trouble and I can’t explain why I simply ignored it. Now it just works the first time as expected (exact same version as tested before).
  2. Although running for over a week for 24h/day with 10+ hours of active use … no freeze or crash observed.
  3. Did multiple salt-based updates … no issues.
Qubes OS R4.1.2
Xen 4.14.5
Kernel 5.15.103-1
Templates all debian-minimal
Filesystem BTRFS
qube type memory
sys-* 250 MB
web app 400 MB
mail 500 MB
web 1000 MB
windows 2048 MB

startup time of web qube (bullseye): 6.70 - 7.09

  • The size of my windows qube has to be limited to 2048 MB due to a regression in the stubdom I gather? Any size larger than that and the qube won’t boot with USB controller attached.
  • The general startup times show a pretty hefty performance hit if you compare that to the numbers I get on the identical hardware with R4.0.4. – this is very annoying but I am not holding my breath for that to get fixed anymore. It’s likely an upstream Xen issue and I don’t think anyone can make them care about 10+ year old CPUs anymore :frowning:
2 Likes