I’m curious about running local AI chatbot models on Qubes. Does anyone here have experience with that? Any recommendations or best practices ?
I’m running Qubes on a pretty powerful desktop, so performance-wise I should be okay. However, I’m unsure how well will they run without NVIDIA’s GPU Drivers.
Has anyone tested CPU-only performance? And Is it worth the hassle ?
Hey,
I can say one thing: CPU performance is fine.
My approach was to install the MSTY.ai application in a separate Standalone qube, then download the AI model I was interested in and disconnect the qube from the internet.
The entire configuration in msty is easy, but I would certainly achieve better results if I connected a second graphics card to the VM with MSTY.ai.
Depends on how many cores you are willing to give it. If you are running large models, don’t have server CPU core count, and have an option to pass GPU to accelerate, do it.
There is some rationale in running it on CPU though. One example is if you are limited in the amount of GPUs (one for guivm, at least one for compute, and you need another one for something else, but can’t get more), you might tolerate lower speed, especially if your CPU is capable. On the other hand, this can be solved by something like seamless gpu passthrough as well.
I have one qube with GPU pass-through running Ollama, and my other qubes can connect to the Ollama API using qrexec.
Qrexec is similar to SSH port forwarding, it binds port 11434 to localhost on the qubes that want to use the Ollama API. It is straightforward to use Ollama in any qube, to applications running in the qube it looks like Ollama is running on localhost.
You can do the same without using a GPU, but the performance will not be great.
I have two 4060 GPUs in my desktop systems, one of them is dedicated to only running LLMs. Don’t know what hardware you have, but many AMD motherboards have an extra 4x PCIe slot with CPU-connected lanes, I use that slot for running the extra GPU.