There’s been a problem I’ve been trying to troubleshoot now for a few months now and what must be 100s of hours but with nothing to show for it. This is out of my depth and I’m not sure what to do. I would appreciate any help or suggestions!
Setup
- x2 MI100
- Asus WRX80e + Threadripper Pro 3955WX
- ROCm all versions
- Linux all kernels
- HVM Passthrough
The issues
- When trying to run llama.cpp (All versions) with CPU+GPU acceleration the GPU freezes at 100% and uses 100W and never recovers.
- VRAM seems to clear extremely slowly.
- Some kernels (5.19) seem to work for small models but freezes when loading larger models. Ram only works perfectly fine.
What I think might be the cause(s)
- Xen Hypervisor issue with VRAM - RAM communication.
- AMD’s kernel module (amdgpu-dkms) and Xen compatibility
- QubesOS’ passthrough hardening?
- Incorrect Xen/Linux kernel parameters
- IOMMU limitation? (iommu=pt has no effect)
This problem is way beyond me and I’m not sure where to start or where to look. I’m hoping Xen 4.19 fixes the problem but I still haven’t been able to figure out a root cause.
Please ask me any questions if you need anything. I feel lost.