Unable to run LLMs with an AMD GPU with ROCm

Rel4967 · August 7, 2024, 11:22pm

There’s been a problem I’ve been trying to troubleshoot now for a few months now and what must be 100s of hours but with nothing to show for it. This is out of my depth and I’m not sure what to do. I would appreciate any help or suggestions!

Setup

x2 MI100
Asus WRX80e + Threadripper Pro 3955WX
ROCm all versions
Linux all kernels
HVM Passthrough

The issues

When trying to run llama.cpp (All versions) with CPU+GPU acceleration the GPU freezes at 100% and uses 100W and never recovers.
VRAM seems to clear extremely slowly.
Some kernels (5.19) seem to work for small models but freezes when loading larger models. Ram only works perfectly fine.

What I think might be the cause(s)

Xen Hypervisor issue with VRAM - RAM communication.
AMD’s kernel module (amdgpu-dkms) and Xen compatibility
QubesOS’ passthrough hardening?
Incorrect Xen/Linux kernel parameters
IOMMU limitation? (iommu=pt has no effect)

This problem is way beyond me and I’m not sure where to start or where to look. I’m hoping Xen 4.19 fixes the problem but I still haven’t been able to figure out a root cause.

Please ask me any questions if you need anything. I feel lost.

4e56rdt89yuh8y09uhj · February 3, 2025, 4:48pm

Did you find a solution?

87uyredfswesdzt · February 3, 2025, 4:49pm

Did you find a solution?

wmrom2 · February 3, 2025, 4:56pm

It’s likely the program you are using doesn’t have a workaround to an integrated GPU.

I am not explaining the rest of this well, so bare with my non-expertise understanding:

If you use an integrated GPU and the GPU is also handling your video output, a program may try to send the model at 100 percent to the GPU (use all of it) and the GPU isn’t able to handle the regular video output and the 100 percent usage that was sent to it. (I read this a while ago, it was possibly 100 percent GPU memory?)

Have you tried LM Studio which has more parameters to limit how much data gets sent to the GPU? Does this happen with all model programs?

solene · February 3, 2025, 5:47pm

It seems OP has 2 hardware https://www.amd.com/fr/products/accelerators/instinct/mi100.html and use passthrough to a qube. The qube does not even need to use it for video rendering because it can use Qubes OS video driver as usual.

wmrom2 · February 4, 2025, 12:49am

You’re right. If it had been that easy, the poster probably would have solved it.

Did the original poster try different programs? Some programs just don’t know how to implement roc correctly and it’s harder to get them to work. I wish I knew how LM studio worked for the poster to help others troubleshoot.