Does Anyone Know How to Create a Local Speech to Text VM within Qubes?

This may not be that Qubes specific, but if someone has already done this and knows of a smart approach, it could save me some time.

I would really like an Linux open source based Speech to Text so I can say things and have it converted to Text. I want this done locally using a GPU and not with a bunch of information sent to some profit-seeking corporation who will monetize everything I say. I bet it’s probably possible to do this.

I would prefer something that can access the GPU if it makes it faster.

I also would like an open-source mostly local AI that can access the Internet, similar to the concept of or some of the features Microsoft has but open source, but local so my data isn’t being sold. It’s probably not possible to do that, but it’s something I’d like.

1 Like

Not what you want to hear, but I have recently tried to setup a speech to text thingy for speeding up blogging. This was the best thing I came up with:

/home/user/.virtualenvs/nerd-dictation/bin/python3 /home/user/code/blu/ begin \
    --vosk-model-dir=/home/user/code/blu/ \
    --punctuate-from-previous-timeout=1 \
    --full-sentence \
    --numbers-as-digits \

It mostly works… But not awesomely. I tried this google docs solution and its nonsense too… All in all I went back to typing on the keyboard for now.

Qubes wise you just have to attach your microphone to the respective qube, theres a “button for that” in window managers bar.

1 Like

I am also looking at GitHub - inboxpraveen/LLM-Minutes-of-Meeting: 🎤📄 An innovative tool that transforms audio or video files into text transcripts and generates concise meeting minutes. Stay organized and efficient in your meetings, and get ready for Phase 2 where we'll be open for contributions to enable real-time meeting transcription! 🚀




It looks like some or all of these need to be compiled. I’m not sure of the best way to do this in Qubes. I’m going to make a thread asking about it.

I am not sure if I am going to try that one or use one that’s based on LLM technology. I really want to reduce the likelihood of something being transcribed incorrectly.

one more thing for this: you can create bad text with FOSS text to speech and then pipe that to ollama. It will fix the sentences, punctuation and so on for you. This way you will get text that is usable and publishable. But you have to re-read it to make sure ollama doesnt use to much fantasy.

This is a wonderful idea!

Do you still think it’s better to use nerd-dictation than the other options listed?

If you’ve already done tests and nerd-dictation is the best, it could save me some time on exploring different models.