Video/audio to text transcription tool

tl;dr: I wrote a small script to transcribe audio into a text file, all processed locally!

Do you spend a ton of time on YouTube? Me too! It’s an amazing resource to learn all sorts of things, buuuuuut…

  • YouTube videos are often full of commercials and sponsorship messages;
  • They are not always organized in chapters;
  • They are not searchable: I love long form videos but I can’t watch them to find out if a tutorial has what I’m looking for.

What would solve all that? For me, it’s text! I looked around to see if I could find a tool to transcribe videos into text locally. No luck! So I decided to try and make one with Python and a speech-to-text recognition tool by OpenAI called Whisper:

  • It’s got a great recognition rate for the English language (and supposedly other languages),
  • It runs locally on a relatively modern laptop;
  • It’s well documented!
OpenAI’s diagram of how Whisper works. Not sure what most of this means but it’s very easy to use!

I started out by drafting the general flow of the script. It would go something like this:

  1. Split a source audio file into chunks using FFmpeg and Pydub, based on relative sound level dips and their duration;
  2. Process each chunk with Whisper to transcribe it;
  3. Finally, save the transcription in a text file.

How did that workflow work? Pretty well! Except the audio chunk files in the folder were processed in the wrong order:

Audio chunk filesThe worst processing order ever
Chunk1.wav
Chunk2.wav
Chunk3.wav
(…)
Chunk10.wav
Chunk11.wav
Chunk12.wav
Chunk1.wav
Chunk10.wav
Chunk11.wav
Chunk12.wav
Chunk2.wav
Chunk3.wav
(…)
Expectations VS reality

Changing filename structures didn’t help… So I appended the name of each chunk to a list as it was created. The audio transcription function would then work its way through the list in order rather than the actual folder… and it worked!

I was also really struggling to pass the source audio file and the target folder to the script…Until I found out about argparse! It let me feed source files and target folders and files as arguments and parameters with argparse’s Parser function. To think it’s one of Python’s built-in modules and I never checked it out before 🙁

The list of Python’s built-in functions in the official documentation. Read’em!

So that’s great and all, but is the script fast enough that it’s worth using? There’s little point transcribing a 17 minutes video if it takes an hour…See for yourself, with a small selection of videos I processed on my laptop!

TitleRunning timeTime to transcript
I can save you money – Raspberry Pi Alternatives15:032 mins 13 secs
Modeling realistic sci-fi helmet with Rachel Frick part 116:441min 20 secs
Fundamental mentorship concepts33:5756 secs

I don’t know why Linus’ Raspberry Pi alternatives video took longer than everybody else. Is it because you talk a lot, Linus? Anyway, here we are!

If you want to give this script it a go, find it on the repository! You will need to install the following in addition to Python:

Once you have all your dependencies installed, just run

python "audio to text converter.py" -h

The script will coach you through everything it needs and tell you what it’s up to. That’s it – good luck!