Running local LLMs with or without GPU acceleration

This guide explains how to install text-generation-webui (oobabooga) on Qubes OS 4.2.0, it can be used with nvidia, amd, and intel arc GPUs, and/or CPU. I will only cover nvidia GPU and CPU, but the steps should be similar for the remaining GPU types.

The GPU used is the nvidia 4060, it might not be exactly the same for nvidia GPUs that use the legacy driver.

Not having a GPU is going to greatly limit the size of model you can use, and even small models are going to take relatively long to execute.

I have tested the time it takes to answer the question ‘what can you tell me about qubes os?’ with a 7B model on 3 different systems running Qubes OS:

i7-8650U (Old laptop CPU): ~200s
i9-13900K (Desktop CPU): ~45s
nvidia 4060 (GPU) with i9-13900K : >10s

Expect the number to drastically increase with the size of the model, bigger models will be practically impossible to use without a GPU.

That said, there are pretty decent 7B models, and they can run on older laptops.

Running LLMs in Qubes OS

Qubes OS isn’t the ideal platform for running LLMs, especially if you plan on running large size models. The bigger models are probably going to give you memory issues, unless you have a system with 64/128GB memory. The models also take up a lot of disk space, you might want to use NAS or DAS for storing the models you don’t currently use, to avoid have to use your Qubes OS storage pool.

If you don’t have a GPU, you can skip to installing text-generation-webui.

GPU passthrough

Follow this guide, it explains how to do passthrough: https://neowutran.ovh/qubes/articles/gaming_windows_hvm.html
I’ll only give a summary of how you configure GPU passthrough, there are already multiple guides going into detail about passthrough.

You are also only going to need CUDA support, which makes passthrough slightly easier.

  1. Find your device ID with lspci.
  2. Hide the device ID from dom0, by adding rd.qubes.hide_pci=ID to grub.
    Generate grub and reboot, grub2-mkconfig -o /boot/grub2/grub.cfg
  3. Check if the device is hidden, sudo lspci -vvn, kernel driver should be pciback.
  4. Use the patch_stubdom.sh script to patch qemu-stubdom-linux-rootfs

If you are having issues with passthrough, search the forum.

Installing the CUDA driver

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install nvidia-kernel-open-dkms
sudo apt-get -y install cuda-drivers

Installing text-generation-webui

Make a qube with 16 GB memory (minimum 8 GB), and 25 GB disk space. If you are using a GPU it needs to be standalone with the kernel supplied by qube, if you used the patch script the name needs to start with gpu_, you also need to install the CUDA driver and pass the GPU.

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux

You will be asked about your hardware, either choose your GPU or select CPU.

Let the installation complete and there should be a web server running on localhost:7860.

Testing a model
The Mistral-7B-OpenOrca-GGUF is a good test model, it should be able to run on most hardware.

cd text-generation-webui/models
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf?download=true

When the file is downloaded, go back to the web interface and refresh the list in the model tab, select the model and load it. Select the CPU option, before loading, if you don’t have a GPU.

You should now be able to use the model in that chat tab.

It should look something like this

5 Likes

with the Debian-12 template on Qubes 4.2 when I clone into a standalone with passthrough I get the error

mount: mounting /dev/mapper/dmroot on /root failed: No such device
Failed to mount /dev/mapper/dmroot as root file system

and I don’t know how to resolve this.

Debian-11 boots up fine with passthrough but I can’t get cuda to take and nvidia-smi to show any device. I’ve spent an ungodly amount of hours fiddling with this to make deb-11 work.

Fedora-37/38 I never get nvidia-smi to show a device either.

Please advice. I am out of ideas. I was using LLMs flawlessly on Qubes 4.1.

Thanks for sharing, @renehoj !

Based on your screenshot, you are getting 2.72 tokens per second with your setup. I would have expected a bit higher values there for a Q4 quantized model (but maybe 8 GB vRAM of your 4060 is simply not that helpful in the end?).

Still, it might be worth double-checking whether it’s actually using your GPU. I believe the command is nvidia-smi.

In comparison: I don’t have a GPU and I do get similar speeds with just running it on CPU. In my experience, the most efficient way to run models is directly using the new llama.cpp server which also has a simple GUI now. In that way you can also always get the latest version of llama.cpp without waiting Oobabooga to update it and you have full control of all the parameters: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Here is an example measurement from my CPU-only setup (AMD Ryzen 7 5800X 8 Core):
Command to run the llama.cpp server with CPU only: ./server -t 4 -ngl 0 -m ~/models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -c 32000

Prompt evaluation: 13.21 tokens/s
Generation: 5.38 tokens/s

Depending on your setup, you probably want to play a bit with the -t parameter, which depends on how many processor threads are available.
I thought, my processor would have 16 threads available but at least with QubesOS, -t 4 performs best for me.

I’ve set VCPUs to 16 in that Qube. Maybe that value is wrong and something could be optimized there? Thinking about it now, I guess I should set that one to 8 VCPUs and then try to use the full 16 threads in llama.cpp.

1 Like

The screenshot isn’t running on a GPU, it’s running a T480 ThinkPad with an i7-8650U CPU.

On the 4060 the output generation is ~44 tokens/s with the same prompt/model.

1 Like

Yes, 44 tokens/s make much more sense with GPU.

I just did some more tests on my CPU setup and it seems that for some reasons, telling llama.cpp that I have 4 threads generates the fastest speeds. It doesn’t even seem to matter whether I’m setting VCPUs for that Qube to 16 or 8.

Maybe this could be further improved?

You can try changing the core count, with and without smt, and pinning the cores could also be worth trying, there might be some Xen overhead that can be reduced.

Trying different model formats might could also be worth it, I don’t know which formats are best purely running on the CPU.