Installed a fresh minimal template and added the networking-agent and firefox to the template, and then I made 5 test qubes with the template.
I tried using the whiskermenu and starting firefox as fast as possible, all 5 qubes worked. I tried doing the same from the terminal with 5 terminals open and the command ready to execute, same result all 5 qubes worked. Tested both methods a second time with the same results.
I am using a reasonable fast desktop CPU, if this is a race condition it might not trigger on my system.
@bebf738vd dom0 and all qubes run 5.15.81-1.fc32.qubes.x86_64
New templates did not solve the issue, now recreating qubes and moving app data manually. If that doesn’t fix it I will take @enums advice and (temporarily?) switch my stuff over to fedora-minimal based.
I’m in the same boat as @BEBF738VD … Indeed sounds very strange; especially with the t430.
Is it possible some update/policy was pushed to your upstream router? Have you the same experience when leveraging an alternate uplink? Might be worth heading to a local cafe to see if the problem persists.
Outside of this, have you grepped your logs for any sign of the hardware issues/failure?
Reaching, I know ..
Have any animals in the household? Maybe upon the case up and hit it with some compressed air?
recreated all templates based on debian-11-minimal
recreated all qubes from scratch based on the new templates and then manually imported the respective settings and data
recreated system and web template based on fedora-36-minimal
In all cases I get the same behavior:
the dispvm always has connectivity
the mail qubes always have connectivity
the other web qubes sometimes have, sometimes don’t … in two cases they lost connectivity while running (other qubes remained online)
I see no hints in dom0 logs.
I cannot overstate how stressful this is. This machine has been my daily driver for a long time. My setup is stable-stable. No tweaking not even installing new apps. I’ve been using it the way it was for months.
Does it make sense to investigate any iptables-related logging (I have not looked up yet to see what all keeps logs - iptables not among them?) for the purpose of checking on the web VMs that have then lose then regain connectivity? You say that you looked at dom0 logs but maybe selectively with key VMs (sys-net or equivalent et al) some highly specific networking-related software for any that keep logs? If you modded enough to be able to run wireshark on a few VMs for internal data collection points (within a web VM or outside of it in a VM further along the path), then perhaps you could collect usable info about what actually happens when the loss happens.
Thanks for all the input. I will do a complete reinstall and start off with standard Fedora templates and create my qubes from scratch. I won’t make any changes in dom0 at all. Not even whisker menu or redshift.
Then I will test this on both of my identical T430.
If I still see issues then, I guess I’ll file a bug report.
complete reinstall with defaults (no dom0 changes, no BTRFS)
switched all templates to fedora-minimal base
all new qubes from scratch (manually imported data)
experimented to find ways to restore connectivity (kill/restart the qube, sys-firewall, sys-net, all qubes)
experimented with ways to make the issue happen (most likely: CPU load or disk I/O)
the problem still happens, caused by CPU or I/O load. In a few cases the connectivity was lost when other qubes got real busy (e.g. extracting a large tar ball)
while I have multiple reliable ways to make the issue happen (start multiple qubes rapidly or make some qubes consume a lot of CPU, I/O) I could not identify a reliable way to recover. Shutting all qubes down, letting the system become idle and then slowly starting them up one after the other was the only way that worked most of the time.
(@unman, @enmus) to my large surprise none of the freezes happened during any of the update/install steps when using fedora-minimal based qubes or everything
performance on my T430 when not using BTRFS is horrible, the fan blares basically non-stop, everything is real slow to the point of being unusable (increasing memory to the qubes doesn’t have any measurable impact)
I need to work and have already lost 2 days on this, need to pull the plug.
the performance issues along with the freezes when using debian-minimal leave me no choice at the moment but to revert my main system / daily driver to R4.0.4 (I understand the implications; this will still be more secure than running another OS on bare metal)
when there is free time on weekends or holidays I will use my second T430 to occasionally check back on R4.1 to see if issues got addressed.
I am a bit rattled at how quickly my beloved stable setup went to utter useless in a few days. Need to chew on that. Happy side note is that if you have your setup well documented (like I do) switching from debian to fedora or to another version is trivial and not nearly as involved as you’d think.
Short update: I’ve downgraded to R4.0.4 to make sure it’s not hardware related and that I haven’t just imagined “everything” being so much smoother and reliable in the past. After a month of daily usage I can report…
minimal resources needed
startup time of web qube (buster): 5.05 - 5.25
startup time of web qube (bullseye): 5.36 - 5.59
I am sure now that I didn’t just imagine this T430 being the most stable, fluid and secure system I’ve ever had. What has happened (subjective and specific to my behavior, hardware and environment) is a steady degradation of the experience. It wasn’t that “quickly” as was my initial impression. First things needed more memory and CPU (R4.1), then the machine started freezing during updates and finally I landed in the hell described in this thread. After working from home since January 2020 this last issue started happening on the first day of my first business trip in 3 years. It was horrible timing.
Anyway. I am still not able to put my finger on what’s causing all of this. I have restored a sane environment for me to work, but it’s EOL. Luckily I have two identical machines and after making sure they are both working the same, I will then advance on one of them back to R4.1 and report back here what I am seeing.
If the problems don’t happen anymore we can close the case as “Sven doing something and a reinstall fixed it”. If that’s not the case, I’ll start creating qubes-issues and provide debug information as request. Finally I (we) must face the possibility that these old machines are just no longer supported well (by Xen and maybe by extension Qubes OS). If that’s the case and other T430 users see the same thing it might be time to remove them from the recommended list.
Thanks for an interesting summary @Sven. I look forward to seeing the
results of your test.
I share your feeling that 4.1 is not a good fit with the older machines:
this is disastrous given that they remain the only certified hardware
that Qubes has.
The only way that I have been able to get anything like decent
performance on 4.1 on a range of x220 and x230 is by allocating more
memory to each qube, scaling back on the number of concurrent qube
sessions, and reaping qubes almost as soon as I am finished with them.
On 4.0.4 I was able to keep unused qubes hanging around in case they
were needed, which made for a much better experience.
I notice significant performance hits when new qubes are started and
when qubes are shut down. My feeling is that memory management is less
performant on 4.1, and this is particularly noticeable at these
transitions. I think this accounts for the major risk of a system
crash during updates when a large number of qubes are cycling state in
succession. I have been able to reduce the risk of this by extending the
update process with pauses between each cycle.
I never presume to speak for the Qubes team.
When I comment in the Forum or in the mailing lists I speak for myself.
Any of them might result in the instability. Some of these are more or less easier to bisect than others. For example, it’s easier to test different dom0 kernels than Xen versions. If unfortunately, R4.1 were confirmed to be unstable on X230 with debian-minimal templates, then I would suggest start building things and installing them to test. Maybe the workflow of openQA can be used for reference.