Warning: Site is still in making, there are a lot of things unfinished

🎤 Speech-to-Text + LLM's

Jan 12, 2024

There is no better way to share information on the web than with plain text. The pinnacle of form: no video, no audio, no infographics or photos. Plain text is all you really need, and there is no better way to approach it than Markdown.

And how do you work with plain text? Mainly you write it. Which, as you may already know, is not the most efficient way. In fact, it’s really slow for almost everyone, whether on paper or on a computer.

And apparently I’ve got a lot of writing to do, and I’ve been doing it for quite a while now. Not that I have any trouble writing, in fact I would describe myself as a pretty good typist. But if there is a better way, why not use it?

Speaking, as in my example, is ~4x faster than writing, but there’s a big problem. I’m not as good at speaking and thinking at the same time as I am at writing. Mostly it comes down to the speed difference, while talking on the fly you really have no time to construct sentences in the way you would like.

Reading a text isn’t the same as having a conversation with someone, conveying information is much more complex in the second case. The context of the pure words on the screen in text format is much less than having a speaker in front of us. It’s not just about dialectics, phonetics or articulation, it’s about how we interact with another living human being in a face-to-face situation.

The text as a form comes with no emotion, we just get the sheer information, the pure words, arranged in such a way as to give the information an emotional connotation.

Having a transcript of what we say can be quite a problem for the reader, it often seems like a drunken person babbling. With no structure or thought behind it, just spitting out words in a random order.

Covering complex topics on the fly isn’t easy either, it works better as a conversation or live performance on stage than talking to your computer.

dsnote


App for note taking, reading and translating with offline Speech to Text, Text to Speech and Machine Translation

https://github.com/mkiol/dsnote

Installation

For now it’s Linux exclusive, as you’ll be installing that, it only comes in the Flatpak or build from source, but obviously choose a Flatpak version and install GPU acceleration add-on. After you’re installing that, make sure to go into the settings and enable GPU acceleration for both speech-to-text and text-to-speech. It makes things a lot faster.

Speech-to-Text

Models

  • Faster Whisper, Faster-Whisper-Large–v3, if the speed isn’t satisfying then try some smaller ones.
  • Vosk – is often recommended, but personally I didn’t have as good of results using it.

Using

That’s pretty much it, just click and listen button, and speak to the mic, after some time you will notice that speaking clearly is not even necessary.

Text-to-Speech

Models

  • Coqui TTS – The only one I’ve been able to run at satisfying speeds, it requires creating speech model first, you need to provide a speech sample for it to copy.

Garbage script


We use it for text generation, so it’s quite obvious that we don’t have a script for recordings that have been made beforehand, but we can make a bare skeleton of things that we need to talk about.

Interpretation


If you have tried speaking to the mic, then you may have discovered that you are not as good a speaker as you think you are, and the transcription of your speech is awful, like a drunk person babbling. It’s just the way we really speak, and it does work, during conversation, presenting it in form of text creates some problems.

I usually hate using LLM’s for anything that needs to be informative, but this time it’s just too good of an option. Since we have injected enough logical information that we need, then we can just use ChatGPT to structure and organize this wall of text. It should be guided enough to spit out something usable, but creating a prompt for it is getting harder and harder with every update, so I hope you will be able to use this.

Of course, some manual correction will be necessary, depending on how good you are at prompt engineering and how good the input text was, but at the end of the day you might even get something usable. If you ever have done some high quality writing, then you know how long it takes, and this technique could simplify some steps, and if you can speak on fly, then it takes the whole hassle of typing it. And it’s not the simple AI SEO rubbish that fills the internet, that you can generate with zero input, based on knowledge of the language model.

Why not fill it with good text from the internet? Well, and why write something that’s already exist? This technic is to create something new yourself, based on information developed in your brain.



More posts like this