Running local LLMs with or without GPU acceleration

renehoj · December 29, 2023, 9:22am

This guide explains how to install text-generation-webui (oobabooga) on Qubes OS 4.2.0, it can be used with nvidia, amd, and intel arc GPUs, and/or CPU. I will only cover nvidia GPU and CPU, but the steps should be similar for the remaining GPU types.

The GPU used is the nvidia 4060, it might not be exactly the same for nvidia GPUs that use the legacy driver.

Not having a GPU is going to greatly limit the size of model you can use, and even small models are going to take relatively long to execute.

I have tested the time it takes to answer the question ‘what can you tell me about qubes os?’ with a 7B model on 3 different systems running Qubes OS:

i7-8650U (Old laptop CPU): ~200s
i9-13900K (Desktop CPU): ~45s
nvidia 4060 (GPU) with i9-13900K : >10s

Expect the number to drastically increase with the size of the model, bigger models will be practically impossible to use without a GPU.

That said, there are pretty decent 7B models, and they can run on older laptops.

Running LLMs in Qubes OS

Qubes OS isn’t the ideal platform for running LLMs, especially if you plan on running large size models. The bigger models are probably going to give you memory issues, unless you have a system with 64/128GB memory. The models also take up a lot of disk space, you might want to use NAS or DAS for storing the models you don’t currently use, to avoid have to use your Qubes OS storage pool.

If you don’t have a GPU, you can skip to installing text-generation-webui.

GPU passthrough

Follow this guide, it explains how to do passthrough: https://neowutran.ovh/qubes/articles/gaming_windows_hvm.html
I’ll only give a summary of how you configure GPU passthrough, there are already multiple guides going into detail about passthrough.

You are also only going to need CUDA support, which makes passthrough slightly easier.

Find your device ID with lspci.
Hide the device ID from dom0, by adding rd.qubes.hide_pci=ID to grub.
Generate grub and reboot, grub2-mkconfig -o /boot/grub2/grub.cfg
Check if the device is hidden, sudo lspci -vvn, kernel driver should be pciback.
Use the patch_stubdom.sh script to patch qemu-stubdom-linux-rootfs

If you are having issues with passthrough, search the forum.

Installing the CUDA driver

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install nvidia-kernel-open-dkms
sudo apt-get -y install cuda-drivers

Installing text-generation-webui

Make a qube with 16 GB memory (minimum 8 GB), and 25 GB disk space. If you are using a GPU it needs to be standalone with the kernel supplied by qube, if you used the patch script the name needs to start with gpu_, you also need to install the CUDA driver and pass the GPU.

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux

You will be asked about your hardware, either choose your GPU or select CPU.

Let the installation complete and there should be a web server running on localhost:7860.

Testing a model
The Mistral-7B-OpenOrca-GGUF is a good test model, it should be able to run on most hardware.

cd text-generation-webui/models
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf?download=true

When the file is downloaded, go back to the web interface and refresh the list in the model tab, select the model and load it. Select the CPU option, before loading, if you don’t have a GPU.

You should now be able to use the model in that chat tab.

It should look something like this

applejack · December 29, 2023, 10:02pm

with the Debian-12 template on Qubes 4.2 when I clone into a standalone with passthrough I get the error

mount: mounting /dev/mapper/dmroot on /root failed: No such device
Failed to mount /dev/mapper/dmroot as root file system

and I don’t know how to resolve this.

Debian-11 boots up fine with passthrough but I can’t get cuda to take and nvidia-smi to show any device. I’ve spent an ungodly amount of hours fiddling with this to make deb-11 work.

Fedora-37/38 I never get nvidia-smi to show a device either.

Please advice. I am out of ideas. I was using LLMs flawlessly on Qubes 4.1.

qpost135 · December 29, 2023, 10:36pm

Thanks for sharing, @renehoj !

Based on your screenshot, you are getting 2.72 tokens per second with your setup. I would have expected a bit higher values there for a Q4 quantized model (but maybe 8 GB vRAM of your 4060 is simply not that helpful in the end?).

Still, it might be worth double-checking whether it’s actually using your GPU. I believe the command is nvidia-smi.

In comparison: I don’t have a GPU and I do get similar speeds with just running it on CPU. In my experience, the most efficient way to run models is directly using the new llama.cpp server which also has a simple GUI now. In that way you can also always get the latest version of llama.cpp without waiting Oobabooga to update it and you have full control of all the parameters: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Here is an example measurement from my CPU-only setup (AMD Ryzen 7 5800X 8 Core):
Command to run the llama.cpp server with CPU only: ./server -t 4 -ngl 0 -m ~/models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -c 32000

Prompt evaluation: 13.21 tokens/s
Generation: 5.38 tokens/s

Depending on your setup, you probably want to play a bit with the -t parameter, which depends on how many processor threads are available.
I thought, my processor would have 16 threads available but at least with QubesOS, -t 4 performs best for me.

I’ve set VCPUs to 16 in that Qube. Maybe that value is wrong and something could be optimized there? Thinking about it now, I guess I should set that one to 8 VCPUs and then try to use the full 16 threads in llama.cpp.

renehoj · December 29, 2023, 11:07pm

The screenshot isn’t running on a GPU, it’s running a T480 ThinkPad with an i7-8650U CPU.

On the 4060 the output generation is ~44 tokens/s with the same prompt/model.

qpost135 · December 29, 2023, 11:15pm

Yes, 44 tokens/s make much more sense with GPU.

I just did some more tests on my CPU setup and it seems that for some reasons, telling llama.cpp that I have 4 threads generates the fastest speeds. It doesn’t even seem to matter whether I’m setting VCPUs for that Qube to 16 or 8.

Maybe this could be further improved?

renehoj · December 29, 2023, 11:40pm

You can try changing the core count, with and without smt, and pinning the cores could also be worth trying, there might be some Xen overhead that can be reduced.

Trying different model formats might could also be worth it, I don’t know which formats are best purely running on the CPU.

mzhgrpqa · April 6, 2024, 4:46am

Tried @renehoj method, but was unsuccessful (issues with drivers and GPU passtrough #4), so played around and got it working using different method:

Hardware used, laptop with:

Processor: Intel Core i7-11800H (8 Cores / 16 Threads) – 2.30GHz (Turbo 4.6 GHz) – 24MB Intel® Smart Cache – 35w-45w TDP
Memory:	2x Kingston Laptop 32GB - DDR4 - 3200MHz-Non-ECC unbuffered - CL22 - 1.2v - SODIMM
Graphics: NVIDIA GeForce GTX-3060 – 6GB DDR6 Video Memory – DirectX 12.1 – NH55HPQ
	      Intel® UHD Graphics for 11th Gen Intel® Processors – DirectX 12.1

lspci:

00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)

First part the same as @renehoj indicated:

Clone debian-12-xfce => debian-12-xfce-NVIDIA
=> settings: exclude memory balance, 32Gb ram, 8VCPUs, Mode: HVM, Devices: add NVIDIA card

sudo aptitude install pciutils linux-image-amd64 linux-headers-amd64 grub2 qubes-kernel-vm-support pkg-config libglvnd-dev 
sudo grub-install /dev/xvda

shutdown
=> settings: Kernel: (provided by qube)
boot up again
(get NVIDIA drivers from https://www.nvidia.com/en-us/geforce/drivers/)

cd QubesIncoming/disp2658/
sudo su
chmod +x *.run
./NVIDIA-Linux-x86_64-535.171.04.run --ui=none --no-x-check

reboot after disabling Nouveau kernel driver
run again: ./NVIDIA-Linux-x86_64-535.171.04.run --ui=none --no-x-check
reboot

nvidia-smi # to check if card is on and id number
sudo nvidia-smi --id 0000:00:05.0 --persistence-mode 1

Create VM based on new template
check settings: exclude memory balance, 32Gb ram, 8VCPUs, Mode: HVM, Devices: add NVIDIA card
my NVIDIA pci id number had changed

sudo nvidia-smi --id 0000:00:06.0 --persistence-mode 1  # <- have to re-run on ever reboot, how do I make it persistent?

wget -c https://releases.lmstudio.ai/linux/0.2.18/beta/LM_Studio-0.2.18.AppImage
sudo chmod +x LM_Studio-0.2.18.AppImage
./LM_Studio-0.2.18.AppImage

tested with the first non-woke LLM => Download Neo LLM
although this is the first, pre-release version, thus the Microsoft’s woke idiocracy is still bleeding through (should be better with next versions)

wget -c https://www.newstarget.com/CWC-data-mirror/Neo-Phi-2-Epoch-2-2024-04-04-v01.zip

question asked: What is qubes-os?
GPU Acceleration turned OFF/ON(low | 50/50 | max):

time to first token: 104.89s	gen t: 8.57s	speed:  7.46 tok/s	stop reason:	gpu layers: 0	cpu threads: 4	mlock: true	token count: 1887/2048
time to first token:   6.14s	gen t: 7.33s	speed:  8.73 tok/s	stop reason:	gpu layers: 4	cpu threads: 4	mlock: true	token count: 1320/2048
time to first token:   1.14s	gen t: 4.67s	speed: 13.70 tok/s	stop reason:	gpu layers: 16	cpu threads: 4	mlock: true	token count: 1158/2048
time to first token:   0.62s	gen t: 1.11s	speed: 57.76 tok/s	stop reason:	gpu layers: 35	cpu threads: 4	mlock: true	token count: 1401/2048

disabling “Keep entire model in RAM”:

time to first token: 94.90s	gen t: 8.21s	speed:  7.80 tok/s	stop reason:	gpu layers: 0	cpu threads: 4	mlock: false	token count: 1725/2048
time to first token:  7.31s	gen t: 7.29s	speed:  8.78 tok/s	stop reason:	gpu layers: 4	cpu threads: 4	mlock: false	token count: 1563/2048
time to first token:  5.24s	gen t: 5.17s	speed: 12.39 tok/s	stop reason:	gpu layers: 16	cpu threads: 4	mlock: false	token count: 1644/2048
time to first token:  0.84s	gen t: 1.17s	speed: 54.94 tok/s	stop reason:	gpu layers: 35	cpu threads: 4	mlock: false	token count: 1806/2048

Useful commands:

#https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries
watch -n1 nvidia-smi #https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu

Hope this is helpful

Other useful source: https://forum.qubes-os.org/t/nvidia-proprietary-driver-installation/18987

destinii1221 · November 23, 2024, 12:20pm

Hi renehoj,

I hope you can help me, because like you, I am looking for a local model system to run a LLM but it should run on a laptop with GPU acceleration.
I would like to use the iGPU for Dom0 and the dedicated GPU of the laptop for the HVM.

Do you have any experience with this? Does it work without a second monitor and only with monitor output?

I haven’t bought a laptop with a GPU yet, but if it works, it would be worth investing in

renehoj · November 23, 2024, 1:47pm

You should probably choose the laptop model you want to buy, and then ask someone that has experience with that model if pass-through works or not.

I don’t know how well suited laptops are for running LLMs, if it is something you plan on using daily, it really requires a lot of VRAM.

I’m using local LLM integration in Brave, Visual Code, and LibreOffice, with 16GB VRAM it works very well, and you are getting fast responses. If you only have 8GB, I think the experience would be a lot worse, and I don’t know if it would be useful at all with less than 8GB.

absent · November 23, 2024, 10:50pm

Could you tell please which nvidia card do you use? Are you on the dasharo coreboot/heads? I would really appreciate if you could share the information.
I am looking forward to enhance my set up but unsure about the hardware…

renehoj · November 23, 2024, 11:57pm

I’m using two Nvidia RTX 4060 Ti 16GB, with the MSI x670e Tomahawk motherboard.

I have used the non-Heads version of Dasharo, don’t know if it works with two GPUs, but it works with one.