My personal speech-to-text workflow

Using GPT and Whisper APIs to quickly save and retrieve ideas

Mar 11, 2024

This is the description of my personal speech-to-text workflow that I've been using for a few weeks now, and it is working remarkably well for my use cases. It allows me to quickly dictate my ideas or responses, transcribe them and format them accordingly.

Overall, there are four important steps here. The first one is to generate the context which is basically to ensure that the transcription is correct by giving it an initial prompt. This prompt is generated using GPT 3.5. It's a very simple use case, and so GPT 3.5 works really well while also being inexpensive. The second step is the actual transcription step. The third step which is after transcription is to do some sort of post-processing by giving it additional context about what's being transcribed. The reason for this third step is mainly because in the first step, Whisper model accepts only the first 244 characters for the initial prompt. If in the future, Whisper starts accepting a larger context window, then I should be able to eliminate the post-processing step. And the final step is basically formatting. Again, this is a step that could be combined with the third step, but just to keep things clear and separate it out, I have this as a separate step to finally format it as an email, which might include salutation and signature, as a Slack message, which is usually pretty informal, or just file it into my notes.

Step 1: Pre-process (Generate Context)
In this step, anytime I want to transcribe something, I am either selecting text and copying it to my clipboard or typing it out when I actually invoke the transcription script and give it that initial context. So, if I'm writing an email, I copy the previous response to which I need to reply and that becomes the input for this step. And then I transform this initial text using the following prompt to create the final input for the Whisper model.

Given the input transcript, identify and summarize all crucial points, ensuring the inclusion of relevant short forms and acronyms. Present the summary concisely, limiting the output to 244 characters. Focus on capturing the essence and context accurately. The output does NOT have to be grammatically correct. If there isn't enough information then produce empty output. Here is the input:

<user_input>

The <user_input> here is swapped by the actual input that I give to my script. The reason I'm using a templating variable is so that I can, in the future, also add additional things after the user input. One of the hacks that has been identified when prompting LLMs is that they seem to give more weightage to the most recent text in the prompt.

One of the main benefits of the step is that in the prompt, you can specify acronyms and short names for things. And if you do, then the Whisper model is able to accurately identify them instead of outputting gibberish.

Step 2: Transcribe
Now, this is the step where the actual transcription happens, and this is where I record an audio, save it as a WAW file, remove silences in between using pydub, and then send it to the Whisper model along with the output prompt of the previous step, and it gives me back the transcription. Most of the code for this is the following:

myrecording = np.concatenate(recording, axis=0)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

print("Recording stopped. Saving recording as wav")
wav_filename = f"./out/{timestamp}.wav"
write(wav_filename, fs, myrecording)

# Remove silence from the audio
no_silence_filename = f"./out/{timestamp}_no_silence.wav"
remove_silence(wav_filename, no_silence_filename)

# Convert the WAV file to MP3
print("Converting to mp3")
mp3_filename = f"./out/{timestamp}.mp3"
audio = AudioSegment.from_wav(no_silence_filename)
audio.export(mp3_filename, format="mp3")

The remove_silence function is simple and is as follows. The main things that I had to tweak quite a bit are the parameters of the split_on_silence function, particularly the silence_threshold value. Initially, it was set to -40 or so, and I was finding that it's actually truncating even when I was talking, just because I wasn't speaking loudly enough. Ideally, I should normalize the sound in an audio file so that even if I am speaking not that loudly, my voice is still getting amplified before removing the silence. But it is working for now, and so I'm keeping it as is. Also, the keep_silence parameter is interesting. One of the things that I noticed is that the Whisper model wasn't correctly adding commas or periods in the transcription, and I realized that it might be because the keep_silence parameter was zero, and it means that as soon as it detects the silence, it would truncate the audio right there instead of there being a little bit of a pause. And with the 100 millisecond keep_silence value, the Whisper model now seems to predict the commas and periods in the final transcription, and it works much better than before.

def remove_silence(in_filename, out_filename):
    """
    Removes silence from the audio file.

    Args:
        in_filename (str): The path to the input audio file.
        out_filename (str): The path to save the output audio file.

    Returns:
        None
    """
    print("Removing silence")
    audio = AudioSegment.from_wav(in_filename)

    # silence_threash: Lower negative value = lesser silence removed = larger file size
    chunks = silence.split_on_silence(
        audio, min_silence_len=500, silence_thresh=-60, keep_silence=100)
    audio_without_silence = sum(chunks)
    audio_without_silence.export(out_filename, format="wav")

Step 3: Post-process (Fix grammar, etc)
Now that we have the transcription from the Whisper model, there are a bunch of things that we can do. Some of the changes I like to make in this step is fixing grammar and language. At times I use the variable names which refer to examples of code. and in those cases, my prompt allows me to format those variables in markdown. A lot of times when I am thinking a lot and dictating at the same time, I tend to repeat things, and this is also one of the things that I ask GPT 3.5 to fix. The main idea in this step is to make the transcription as accurate as possible so that it can be later transformed into different types of formats, which is handled by the next step.

Step 4: Format
This is the step that I am starting to dive into, and it's pretty nascent at this point, but in this step, the idea is to give context to the transformation function about the medium of communication and/or storage. A lot of times it is just fixing tiny nuances and adding additional data, which is appropriate for different formats. So if it's an email, there are going to be additional salutation and signature blocks with a formal style of writing. If it's a Slack message, it's going to be pretty informal and almost verbatim. If it's a note, then I like to prefix that note with the date and time of the transcription and save it in Notion or Obsidian. For a tweet you could add hashtags or make sure that you truncate it to the number of characters you can actually tweet about, or perhaps add a link. But overall, this is a step that I'm still playing with, and I still need to figure out the right prompting to adapt for different types of formats while making little or no change to the content.

Telemetry
I'm tracking a few metrics as part of using this workflow. I am, of course, saving the output transcriptions and the raw audio files as well as the converted ones. I'm also keeping track of the latency of requests made to OpenAI, both for the GPT model as well as Whisper. These are just some basic metrics that help me keep an eye on how things are going.

Summary
Overall, I'm finding it pretty addictive to be able to dictate what end up becoming large chunks of notes instead of having to write the whole thing or finding the time to. I'm also enjoying the fact that I can just talk about stuff very quickly and narrate a lot of ideas and just save it and make them searchable. Previously, whenever I would have these different ideas, I would use my Apple Watch to create a voice memo. I think the voice memo app in Apple does allow you to transcribe things, but I don't think it was ever that accurate. With this new workflow, I can just record stuff and because it converts it pretty accurately into text and automatically saves it in my notes, all those ideas immediately become searchable. This workflow has two really great features:

I can immediately speak my mind and record it whatever thoughts I have.
It's immediately searchable in textual format.

I'm finding that I often do these big brain dumps using this workflow. And then once I have the transcription, I basically go back to it in the next couple of days and add things to it. Sometimes these are screenshots or just notes or photographs or whatever, but this has allowed me to save a lot of my ideas in what some people call a second brain and retrieve it efficiently when needed.

Next steps
There are some optimizations that I want to make that I've talked about earlier here, but overall the system works for me. One thing that would be helpful is to be able to give the context to the transcription from whichever app I'm using. So for example, if I'm on Slack then I want to be able to use this workflow from within Slack, or if I'm Gmail, I want to be able to use this workflow from Gmail itself without having to copy text, open my terminal, invoke the script, dictate what I want to respond with, and then copy the output and paste it back into that app.

On the cost front, I’m using this pretty regularly - several times a day - and so far it has costed me <$1/week. So pretty cost effective!

References

Appendix

This is a screenshot of the script in action. I added a paragraph to one of the sections above. Steps 3 and 4 are currently being done manually and I’m yet to incorporate them in my script.

Sagar's Substack

Discussion about this post