What is Whisper?
Whisper is an Open Source automatic speech recognition (ASR) system developed by OpenAI and trained on 680,000 hours of multilingual and multitask supervised data. It allows us to transcribe audio in multiple languages.
It has many real-world applications, such as:
- Video Subtitling: Generating subtitles with the ability to translate into multiple languages.
- Personal Assistants: Transcribing meetings, interviews, or voice notes.
The best part is that it is not complicated to get started. Here is a step-by-step guide to help you take your first steps.
How to use Whisper?
Depending on your technical resources and the level of privacy you need, you can choose one of these three paths:
Google Colab
You can use a Google Colab notebook to run the code without installing anything on your PC, taking advantage of Google's free GPUs.
Local Installation
You can install Whisper directly on your PC. For example, on a Ryzen 5 5600G with 16GB of RAM, the base model performs very well. If you have a dedicated graphics card (NVIDIA), Whisper will run much faster.
OpenAI API
If you are looking to integrate Whisper into an application or don't want to manage servers, the API is the solution. Here you pay per minute of audio, but it is extremely cost-effective.
A quick Python example:
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY_HERE")
audio_file_path = "file.mp3"
with open(audio_file_path, "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text" # "json" or "vtt" for subtitles
)
print(transcription)
Installing on Linux and macOS
To install Whisper, you need to have Python installed. Run the following command in your terminal:
pip install -U openai-whisper
It is essential to install ffmpeg, a multimedia processing tool, to read audio and video files.
On Ubuntu or Debian:
sudo apt update && sudo apt install ffmpeg
On macOS with Homebrew:
brew install ffmpeg
Basic Usage
Once installed, you will have access to the whisper command from the terminal. To process a file, use the following command:
whisper file.mp4 --language English --model base
Main parameters:
--language: Sets the original language of the audio to improve accuracy.--model: Selects the model size based on your hardware and accuracy needs. Available models are:tiny,base,small,medium, andlarge.
You can find more technical information and the source code in the official repository: https://github.com/openai/whisper
Example
Generating a transcription of a video with Whisper.
When running the command, the following formats are generated:
.txt: Plain text only. No timestamps or extras. Ideal for notes or articles..srt: The universal subtitle standard. Compatible with YouTube and video players..vtt: Similar to SRT, but optimized for web players (HTML5)..json: Contains everything (timestamps, confidence, metadata). Ideal for developers..tsv: Tab-separated values. Perfect for opening in Excel or Google Sheets.