Has anyone been able to run LLMs with an AMD GPU with ROCm?

There’s been a problem I’ve been trying to troubleshoot now for a few months now and what must be 100s of hours but with nothing to show for it. This is out of my depth and I’m not sure what to do. I would appreciate any help or suggestions!

Setup

  • x2 MI100
  • Asus WRX80e + Threadripper Pro 3955WX
  • ROCm all versions
  • Linux all kernels
  • HVM Passthrough

The issues

  • When trying to run llama.cpp (All versions) with CPU+GPU acceleration the GPU freezes at 100% and uses 100W and never recovers.
  • VRAM seems to clear extremely slowly.
  • Some kernels (5.19) seem to work for small models but freezes when loading larger models. Ram only works perfectly fine.

What I think might be the cause(s)

  • Xen Hypervisor issue with VRAM - RAM communication.
  • AMD’s kernel module (amdgpu-dkms) and Xen compatibility
  • QubesOS’ passthrough hardening?
  • Incorrect Xen/Linux kernel parameters
  • IOMMU limitation? (iommu=pt has no effect)

This problem is way beyond me and I’m not sure where to start or where to look. I’m hoping Xen 4.19 fixes the problem but I still haven’t been able to figure out a root cause.

Please ask me any questions if you need anything. I feel lost.

1 Like