Qubes lack connectivity based on how they are started

@Sven

is it possible that the root cause is related to VPN kill switch feature ?

sometimes i have similar issue, sometimes can / cannot connect,
because i forget if the VPN kill switch feature is on,
so i don’t have connection while VPN sometimes cannot connect.

@newbie I don’t use VPN (anymore).

Things I have tried:

  • complete reinstall with defaults (no dom0 changes, no BTRFS)
  • switched all templates to fedora-minimal base
  • all new qubes from scratch (manually imported data)
  • experimented to find ways to restore connectivity (kill/restart the qube, sys-firewall, sys-net, all qubes)
  • experimented with ways to make the issue happen (most likely: CPU load or disk I/O)

Results:

  • the problem still happens, caused by CPU or I/O load. In a few cases the connectivity was lost when other qubes got real busy (e.g. extracting a large tar ball)
  • while I have multiple reliable ways to make the issue happen (start multiple qubes rapidly or make some qubes consume a lot of CPU, I/O) I could not identify a reliable way to recover. Shutting all qubes down, letting the system become idle and then slowly starting them up one after the other was the only way that worked most of the time.
  • (@unman, @enmus) to my large surprise none of the freezes happened during any of the update/install steps when using fedora-minimal based qubes or everything
  • performance on my T430 when not using BTRFS is horrible, the fan blares basically non-stop, everything is real slow to the point of being unusable (increasing memory to the qubes doesn’t have any measurable impact)

Next steps:

  • I need to work and have already lost 2 days on this, need to pull the plug.
  • the performance issues along with the freezes when using debian-minimal leave me no choice at the moment but to revert my main system / daily driver to R4.0.4 (I understand the implications; this will still be more secure than running another OS on bare metal)
  • when there is free time on weekends or holidays I will use my second T430 to occasionally check back on R4.1 to see if issues got addressed.

I am a bit rattled at how quickly my beloved stable setup went to utter useless in a few days. Need to chew on that. Happy side note is that if you have your setup well documented (like I do) switching from debian to fedora or to another version is trivial and not nearly as involved as you’d think.

3 Likes

Have these words stuck right above your display.

Then you’d know if it’s about the hardware if outcomes would differ?

@Sven

not sure whether a good alternative solution,
but maybe you want to try,
try to create 3 sys-net & 3 sys-firewall,

  • 1 is connected via wifi card,
  • 1 is connected via USB dongle to wifi,
  • 1 is connected via ethernet

Short update: I’ve downgraded to R4.0.4 to make sure it’s not hardware related and that I haven’t just imagined “everything” being so much smoother and reliable in the past. After a month of daily usage I can report…

Qubes OS R4.0.4
Xen 4.8.5-42.fc25
Kernel 5.4.190-1
Templates all debian-minimal
Filesystem BTRFS
  • super stable
  • very fluid
  • minimal resources needed
qube type memory
sys-* 250 MB
web app 400 MB
mail 500 MB
web 1000 MB
windows 4000 MB

startup time of web qube (buster): 5.05 - 5.25
startup time of web qube (bullseye): 5.36 - 5.59

I am sure now that I didn’t just imagine this T430 being the most stable, fluid and secure system I’ve ever had. What has happened (subjective and specific to my behavior, hardware and environment) is a steady degradation of the experience. It wasn’t that “quickly” as was my initial impression. First things needed more memory and CPU (R4.1), then the machine started freezing during updates and finally I landed in the hell described in this thread. After working from home since January 2020 this last issue started happening on the first day of my first business trip in 3 years. It was horrible timing.

Anyway. I am still not able to put my finger on what’s causing all of this. I have restored a sane environment for me to work, but it’s EOL. Luckily I have two identical machines and after making sure they are both working the same, I will then advance on one of them back to R4.1 and report back here what I am seeing.

If the problems don’t happen anymore we can close the case as “Sven doing something and a reinstall fixed it”. If that’s not the case, I’ll start creating qubes-issues and provide debug information as request. Finally I (we) must face the possibility that these old machines are just no longer supported well (by Xen and maybe by extension Qubes OS). If that’s the case and other T430 users see the same thing it might be time to remove them from the recommended list.

2 Likes

Thanks for an interesting summary @Sven. I look forward to seeing the
results of your test.
I share your feeling that 4.1 is not a good fit with the older machines:
this is disastrous given that they remain the only certified hardware
that Qubes has.

The only way that I have been able to get anything like decent
performance on 4.1 on a range of x220 and x230 is by allocating more
memory to each qube, scaling back on the number of concurrent qube
sessions, and reaping qubes almost as soon as I am finished with them.
On 4.0.4 I was able to keep unused qubes hanging around in case they
were needed, which made for a much better experience.

I notice significant performance hits when new qubes are started and
when qubes are shut down. My feeling is that memory management is less
performant on 4.1, and this is particularly noticeable at these
transitions. I think this accounts for the major risk of a system
crash during updates when a large number of qubes are cycling state in
succession. I have been able to reduce the risk of this by extending the
update process with pauses between each cycle.

I never presume to speak for the Qubes team. When I comment in the Forum or in the mailing lists I speak for myself.
3 Likes

There are a few main differences between R4.0.4 and R4.1.

  • Xen version
  • Dom0 kernel version
  • VM kernel version
  • Qubes infrastructure (qrexec, core-admin-agent, etc.)
  • Debian template version

Any of them might result in the instability. Some of these are more or less easier to bisect than others. For example, it’s easier to test different dom0 kernels than Xen versions. If unfortunately, R4.1 were confirmed to be unstable on X230 with debian-minimal templates, then I would suggest start building things and installing them to test. Maybe the workflow of openQA can be used for reference.

This is well known.
It has been reported (and this was my experience) that a clean install
of 4.1 did not evidence these issues, and that they developed after a
series of updates.

1 Like

I’m still rather suspicious of the changes in/to the newer xen memory grants infrastructure leading to much instability…perhaps through memory fragmentation…or something else.

B

I’m back on R4.1 for a bit over a week and things are stable.

I can’t observe the original issue of this thread nor any crash/freeze. My conclusion is that in my specific case the root cause more likely then not was in my version of coreboot/heads I build myself (v0.2.0-1150) from the osresearch/heads repository. Once I switched to v1.4 of the Nitrokey/heads repository no more issues can be observed:

  1. The R4.1 installer was always crashing on me and it took 2-3 attempts to install Qubes OS. In hindsight that was an obvious sign of trouble and I can’t explain why I simply ignored it. Now it just works the first time as expected (exact same version as tested before).
  2. Although running for over a week for 24h/day with 10+ hours of active use … no freeze or crash observed.
  3. Did multiple salt-based updates … no issues.
Qubes OS R4.1.2
Xen 4.14.5
Kernel 5.15.103-1
Templates all debian-minimal
Filesystem BTRFS
qube type memory
sys-* 250 MB
web app 400 MB
mail 500 MB
web 1000 MB
windows 2048 MB

startup time of web qube (bullseye): 6.70 - 7.09

  • The size of my windows qube has to be limited to 2048 MB due to a regression in the stubdom I gather? Any size larger than that and the qube won’t boot with USB controller attached.
  • The general startup times show a pretty hefty performance hit if you compare that to the numbers I get on the identical hardware with R4.0.4. – this is very annoying but I am not holding my breath for that to get fixed anymore. It’s likely an upstream Xen issue and I don’t think anyone can make them care about 10+ year old CPUs anymore :frowning:
2 Likes

Is that 6 MINUTES? Just to start a single qube?

It’s seconds. :slight_smile:

If it were minutes I wouldn’t be so casual about it. To compare these are the times I see on R4.0 using the exact same hardware:

startup time of web qube (buster): 5.05 - 5.25
startup time of web qube (bullseye): 5.36 - 5.59

Phew

hello sven

Famous last words.

Freeze happened twice this week. Journal says the last thing that happened before the last freeze is systemd-tmpfiles-clean.

:frowning:

3 Likes