@cayce all as in all … there are only debian minimal based templates/qubes. I also documented how I create them in this forum and on my website (unfinished draft).
@enmus very interesting. I’ll give this a try soon.
@cayce all as in all … there are only debian minimal based templates/qubes. I also documented how I create them in this forum and on my website (unfinished draft).
@enmus very interesting. I’ll give this a try soon.
What kernel are you running on dom0 and vms?
I’ve also been running an all-debian minimal setup (kernel-latest for dom0 and 5.10 for vms) for months now without a single issue, so the situation you’ve described is indeed weird.
@bebf738vd dom0 and all qubes run 5.15.81-1.fc32.qubes.x86_64
New templates did not solve the issue, now recreating qubes and moving app data manually. If that doesn’t fix it I will take @enums advice and (temporarily?) switch my stuff over to fedora-minimal based.
I’m in the same boat as @BEBF738VD … Indeed sounds very strange; especially with the t430.
Is it possible some update/policy was pushed to your upstream router? Have you the same experience when leveraging an alternate uplink? Might be worth heading to a local cafe to see if the problem persists.
Outside of this, have you grepped your logs for any sign of the hardware issues/failure?
Have any animals in the household? Maybe upon the case up and hit it with some compressed air?
When you decide this, you can PM me and I can send you my notes on creating different templates and use cases. They look for example like:
fedora-37-min-sys-usb-xHCI-template
--------------------------------
mlocate qubes-input-proxy-sender qubes-usb-proxy usbutils
Maybe it is needed
[user@dom0 ~]$ qvm-pci attach --persistent --option permissive=true sys-usb dom0:00_14.0
fedora-37-min-sys-firewall-template
-------------------------------
iproute iptables-legacy iptables-legacy-libs iptables-libs nftables qubes-core-agent-dom0-updates qubes-core-agent-networking tinyproxy
etc…
Ok, here are all the things I did:
In all cases I get the same behavior:
I see no hints in dom0 logs.
I cannot overstate how stressful this is. This machine has been my daily driver for a long time. My setup is stable-stable. No tweaking not even installing new apps. I’ve been using it the way it was for months.
I’m likely off-base but:
Does it make sense to investigate any iptables-related logging (I have not looked up yet to see what all keeps logs - iptables not among them?) for the purpose of checking on the web VMs that have then lose then regain connectivity? You say that you looked at dom0 logs but maybe selectively with key VMs (sys-net or equivalent et al) some highly specific networking-related software for any that keep logs? If you modded enough to be able to run wireshark on a few VMs for internal data collection points (within a web VM or outside of it in a VM further along the path), then perhaps you could collect usable info about what actually happens when the loss happens.
Thanks for all the input. I will do a complete reinstall and start off with standard Fedora templates and create my qubes from scratch. I won’t make any changes in dom0 at all. Not even whisker menu or redshift.
Then I will test this on both of my identical T430.
If I still see issues then, I guess I’ll file a bug report.
is it possible that the root cause is related to VPN kill switch feature ?
sometimes i have similar issue, sometimes can / cannot connect,
because i forget if the VPN kill switch feature is on,
so i don’t have connection while VPN sometimes cannot connect.
@newbie I don’t use VPN (anymore).
Things I have tried:
Results:
Next steps:
I am a bit rattled at how quickly my beloved stable setup went to utter useless in a few days. Need to chew on that. Happy side note is that if you have your setup well documented (like I do) switching from debian to fedora or to another version is trivial and not nearly as involved as you’d think.
Have these words stuck right above your display.
Then you’d know if it’s about the hardware if outcomes would differ?
not sure whether a good alternative solution,
but maybe you want to try,
try to create 3 sys-net & 3 sys-firewall,
Short update: I’ve downgraded to R4.0.4 to make sure it’s not hardware related and that I haven’t just imagined “everything” being so much smoother and reliable in the past. After a month of daily usage I can report…
Qubes OS | R4.0.4 |
Xen | 4.8.5-42.fc25 |
Kernel | 5.4.190-1 |
Templates | all debian-minimal |
Filesystem | BTRFS |
qube type | memory |
---|---|
sys-* | 250 MB |
web app | 400 MB |
500 MB | |
web | 1000 MB |
windows | 4000 MB |
startup time of web qube (buster): 5.05 - 5.25
startup time of web qube (bullseye): 5.36 - 5.59
I am sure now that I didn’t just imagine this T430 being the most stable, fluid and secure system I’ve ever had. What has happened (subjective and specific to my behavior, hardware and environment) is a steady degradation of the experience. It wasn’t that “quickly” as was my initial impression. First things needed more memory and CPU (R4.1), then the machine started freezing during updates and finally I landed in the hell described in this thread. After working from home since January 2020 this last issue started happening on the first day of my first business trip in 3 years. It was horrible timing.
Anyway. I am still not able to put my finger on what’s causing all of this. I have restored a sane environment for me to work, but it’s EOL. Luckily I have two identical machines and after making sure they are both working the same, I will then advance on one of them back to R4.1 and report back here what I am seeing.
If the problems don’t happen anymore we can close the case as “Sven doing something and a reinstall fixed it”. If that’s not the case, I’ll start creating qubes-issues and provide debug information as request. Finally I (we) must face the possibility that these old machines are just no longer supported well (by Xen and maybe by extension Qubes OS). If that’s the case and other T430 users see the same thing it might be time to remove them from the recommended list.
Thanks for an interesting summary @Sven. I look forward to seeing the
results of your test.
I share your feeling that 4.1 is not a good fit with the older machines:
this is disastrous given that they remain the only certified hardware
that Qubes has.
The only way that I have been able to get anything like decent
performance on 4.1 on a range of x220 and x230 is by allocating more
memory to each qube, scaling back on the number of concurrent qube
sessions, and reaping qubes almost as soon as I am finished with them.
On 4.0.4 I was able to keep unused qubes hanging around in case they
were needed, which made for a much better experience.
I notice significant performance hits when new qubes are started and
when qubes are shut down. My feeling is that memory management is less
performant on 4.1, and this is particularly noticeable at these
transitions. I think this accounts for the major risk of a system
crash during updates when a large number of qubes are cycling state in
succession. I have been able to reduce the risk of this by extending the
update process with pauses between each cycle.
There are a few main differences between R4.0.4 and R4.1.
Any of them might result in the instability. Some of these are more or less easier to bisect than others. For example, it’s easier to test different dom0 kernels than Xen versions. If unfortunately, R4.1 were confirmed to be unstable on X230 with debian-minimal templates, then I would suggest start building things and installing them to test. Maybe the workflow of openQA can be used for reference.
This is well known.
It has been reported (and this was my experience) that a clean install
of 4.1 did not evidence these issues, and that they developed after a
series of updates.
I’m still rather suspicious of the changes in/to the newer xen memory grants infrastructure leading to much instability…perhaps through memory fragmentation…or something else.
B
I’m back on R4.1 for a bit over a week and things are stable.
I can’t observe the original issue of this thread nor any crash/freeze. My conclusion is that in my specific case the root cause more likely then not was in my version of coreboot/heads I build myself (v0.2.0-1150) from the osresearch/heads repository. Once I switched to v1.4 of the Nitrokey/heads repository no more issues can be observed:
Qubes OS | R4.1.2 |
Xen | 4.14.5 |
Kernel | 5.15.103-1 |
Templates | all debian-minimal |
Filesystem | BTRFS |
qube type | memory |
---|---|
sys-* | 250 MB |
web app | 400 MB |
500 MB | |
web | 1000 MB |
windows | 2048 MB |
startup time of web qube (bullseye): 6.70 - 7.09
Is that 6 MINUTES? Just to start a single qube?