Video/audio to text transcription tool

tl;dr: I wrote a small script to transcribe audio into a text file, all processed locally!

Do you spend a ton of time on YouTube? Me too! It’s an amazing resource to learn all sorts of things, buuuuuut…

YouTube videos are often full of commercials and sponsorship messages;
They are not always organized in chapters;
They are not searchable: I love long form videos but I can’t watch them to find out if a tutorial has what I’m looking for.

What would solve all that? For me, it’s text! I looked around to see if I could find a tool to transcribe videos into text locally. No luck! So I decided to try and make one with Python and a speech-to-text recognition tool by OpenAI called Whisper:

It’s got a great recognition rate for the English language (and supposedly other languages),
It runs locally on a relatively modern laptop;
It’s well documented!

OpenAI’s diagram of how Whisper works. Not sure what most of this means but it’s very easy to use!

I started out by drafting the general flow of the script. It would go something like this:

Split a source audio file into chunks using FFmpeg and Pydub, based on relative sound level dips and their duration;
Process each chunk with Whisper to transcribe it;
Finally, save the transcription in a text file.

How did that workflow work? Pretty well! Except the audio chunk files in the folder were processed in the wrong order:

Audio chunk files	The worst processing order ever
Chunk1.wav Chunk2.wav Chunk3.wav (…) Chunk10.wav Chunk11.wav Chunk12.wav	Chunk1.wav Chunk10.wav Chunk11.wav Chunk12.wav Chunk2.wav Chunk3.wav (…)

Expectations VS reality

Changing filename structures didn’t help… So I appended the name of each chunk to a list as it was created. The audio transcription function would then work its way through the list in order rather than the actual folder… and it worked!

I was also really struggling to pass the source audio file and the target folder to the script…Until I found out about argparse! It let me feed source files and target folders and files as arguments and parameters with argparse’s Parser function. To think it’s one of Python’s built-in modules and I never checked it out before 🙁

The list of Python’s built-in functions in the official documentation. Read’em!

So that’s great and all, but is the script fast enough that it’s worth using? There’s little point transcribing a 17 minutes video if it takes an hour…See for yourself, with a small selection of videos I processed on my laptop!

Title	Running time	Time to transcript
I can save you money – Raspberry Pi Alternatives	15:03	2 mins 13 secs
Modeling realistic sci-fi helmet with Rachel Frick part 1	16:44	1min 20 secs
Fundamental mentorship concepts	33:57	56 secs

I don’t know why Linus’ Raspberry Pi alternatives video took longer than everybody else. Is it because you talk a lot, Linus? Anyway, here we are!

If you want to give this script it a go, find it on the repository! You will need to install the following in addition to Python:

FFMpeg
The Pydub Python library
Your Whisper model of choice (I went with the “base” model)

Once you have all your dependencies installed, just run

python "audio to text converter.py" -h

The script will coach you through everything it needs and tell you what it’s up to. That’s it – good luck!