Using OpenAI's Whisper to transcribe videos

What is Whisper?

Whisper is an Open Source automatic speech recognition (ASR) system developed by OpenAI and trained on 680,000 hours of multilingual and multitask supervised data. It allows us to transcribe audio in multiple languages.

It has many real-world applications, such as:

Video Subtitling: Generating subtitles with the ability to translate into multiple languages.
Personal Assistants: Transcribing meetings, interviews, or voice notes.

The best part is that it is not complicated to get started. Here is a step-by-step guide to help you take your first steps.

How to use Whisper?

Depending on your technical resources and the level of privacy you need, you can choose one of these three paths:

Google Colab

You can use a Google Colab notebook to run the code without installing anything on your PC, taking advantage of Google's free GPUs.

Local Installation

You can install Whisper directly on your PC. For example, on a Ryzen 5 5600G with 16GB of RAM, the base model performs very well. If you have a dedicated graphics card (NVIDIA), Whisper will run much faster.

OpenAI API

If you are looking to integrate Whisper into an application or don't want to manage servers, the API is the solution. Here you pay per minute of audio, but it is extremely cost-effective.

A quick Python example:

from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY_HERE")

audio_file_path = "file.mp3"

with open(audio_file_path, "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="whisper-1", 
        file=audio_file,
        response_format="text" # "json" or "vtt" for subtitles
    )

print(transcription)

Installing on Linux and macOS

To install Whisper, you need to have Python installed. Run the following command in your terminal:

pip install -U openai-whisper

It is essential to install ffmpeg, a multimedia processing tool, to read audio and video files.

On Ubuntu or Debian:

sudo apt update && sudo apt install ffmpeg

On macOS with Homebrew:

brew install ffmpeg

Basic Usage

Once installed, you will have access to the whisper command from the terminal. To process a file, use the following command:

whisper file.mp4 --language English --model base

Main parameters:

--language: Sets the original language of the audio to improve accuracy.
--model: Selects the model size based on your hardware and accuracy needs. Available models are: tiny, base, small, medium, and large.

You can find more technical information and the source code in the official repository: https://github.com/openai/whisper

Example

Generating a transcription of a video with Whisper.

When running the command, the following formats are generated:

.txt: Plain text only. No timestamps or extras. Ideal for notes or articles.
.srt: The universal subtitle standard. Compatible with YouTube and video players.
.vtt: Similar to SRT, but optimized for web players (HTML5).
.json: Contains everything (timestamps, confidence, metadata). Ideal for developers.
.tsv: Tab-separated values. Perfect for opening in Excel or Google Sheets.

Blog