As I understand, it’s (often) the current requirements… as load changes, a disk can go from milliamps to amps of current, sometimes in very short time periods.
Snore...
If there’s any resistance or other stray inductance in the supply lines, it can lead to a voltage at the disk, or elsewhere, going out of compliance. Often digital electronics has a limited tolerance for signals going outside the supply range. At worst, diodes can get latched into an unrecoverable conducting state until the next power cycle (- at least, they used to, modern hardware is pretty well protected I think, thanks to hot-swap requirements, so I don’t know if it’s still a significant problem.)
Also, if there’s enough resistance to get local heating, it can make a normally-OK contact fail, temporarily or permanently, or have intermittent contact.
Depending on the specific mode of failure, and how the devices behave, it could require power cycling, a cooling down period, or anything up to replacement due to permanent damage.
A really good infra-red camera can sometimes be a great help… although not always (I haven’t had access to one for quite a long time, so maybe even the mass-market ones are good enough.)
If we’re lucky, it’s only a problem in the cable… or a dirty connector we can clean, or “re-seat”… although in this case it’s not so clear.
Of course, it could still be a driver or hardware design bug…
This was the kind of detailed response I hoped for, thank you.
It sounds like the most likely symptom of the type of failure you describe is a system crash (which might look like a total and unrecoverable freeze). Is that something you’d agree with?
But then this isn’t consistent with a failure type described in the OP:
maybe the mouse moves with an extreme delay but thats it.
Rather it suggests to me the system maintains its coherence, and is still running, albeit imperceptibly slowly.
…Ive learned IRL to carefully watch my audience for glazed eyes and other signs of “too much detail”
The mouse thing is interesting - I’ve got no model for failure modes in the presence of hardware virt. On plain Linux/MSwin I would once have said it’s just an interrupt handler that’s still working, but the kernel is dead or locked in some bus-reset death-loop.
With xen and hw virt, I guess there’s still the possibility for low level stuff to limp along…
…(wild speculation here…) maybe the CPU is fine, every vm kernel is idle, waiting on something to come from a disk, but the usb controller just succeeds to raise a single interrupt before it gets reset, and the gui has plenty of time to handle it.
I don’t know how Linux handles misbehaving PCI/disk devices to be able to say… But it is telling that kernel Devs can use usb for debugging, which suggests it’s a subsystem that doesn’t give up easily.
That cable looks correct - I am guessing it connects from one supply to the other: one position which normally has a black wire, and the position which normally has a green wire.
I haven’t studied the details for modern pc hardware, but from (moderate) experience with general electronics, often it will work, except for when it doesn’t. Some supplies pull to zero output if they get overloaded, and if zero isn’t the same everywhere then the other supply might be pulling in a different direction, possibly via the bus driver chips. Similarly if a component fails…
With hot swappable hardware, it might not give permanent damage (I don’t know this, just guessing) but I wouldn’t risk it, even just to avoid possible data corruption.
Is the same with changing the xen scheduler to credit from credit2?
To change it, add sched=credit to GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub and regenerate grub.cfg
I’m not a Xen expert, but from what I understand, credit2 was designed to reduce latency and improve real-time performance. However, it is not as good at fair allocation of physical CPUs compared to the credit scheduler. This can lead to situations where some vCPUs are consistently assigned to certain physical CPU cores, causing those cores to become overloaded and potentially slowing down the system.
But, since you are using a Zen3 APU, it might be worth trying some common workarounds used in general Linux systems, such as disabling C-State C6.
Honestly it’s running incredibly smoothly right now.
Will already mark and leave this as solved as long it does not crash again. (will take a while to complete the scrub)