Introduction
OpenAI is a pure player in the field of Artificial Intelligence and has made accessible to the community many AI models including GPT, CLIP, etc.
Open-sourced by OpenAI, the Whisper models are considered to have approached human-level robustness and accuracy in English speech recognition.
This article will try to walk you through all the steps to transform long pieces of audio into textual information with OpenAI’s Whisper using the HugginFaces Transformers frameworks.
At the end of this article, you will be able to translate English and non-English audio into text.
OpenAI’s Whisper – Kézako?
Whisper models have been developed to study the capability of speech-processing systems for speech recognition and translation tasks. They have the capability of transcribing speech audio into text.
Trained on 680,000 hours of labeled audio data, which is reported by the authors to be one of the largest ever created in supervised speech recognition. Also, the model’s performance has been evaluated by training a series of medium-sized models on subsampled versions of the data corresponding to 0.5%, 1%, 2%, 4%, and 8% of the full dataset size as shown below.

Step-by-step implementation
This section covers all the steps from installing and importing the relevant modules to implementing the audio transcription and translation cases.
Installation and initializations
To begin, you need to have Python installed on your computer along with the Whisper library, and the latest stable version can be installed using the Python package manager pip
as follows:
!pip install git+https://github.com/openai/whisper.git
Now, we need to install and import theffmpeg
module which is used for audio and video processing. The installation process may differ depending on your operating system.
Since I am using a MAC, here is the corresponding process:
# Installation for MAC
brew install ffmpeg
Please refer to the correct code snippet for your case
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
What if you don’t want to bother with all of these configs?
→ Google collab could save your life in such a situation, and it also provides a free GPU that you can access as follows:

Using the nvidia-smi
we can have the information about the GPU allocated to you, and here is mine.
!nvidia-smi

Once you have everything installed, you can import the modules and load the model. In our case, we will be using the large model which has 1550M parameters and requires ~10Gigabyte VRAM memory. The processing can be longer or faster whether you are using a CPU or a GPU.
# Import the libraries
import whisper
import torch
import os
# Initialize the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model
whisper_model = whisper.load_model("large", device=device)
- In the
load_model()
function, we use thedevice
initiated in the line before. By default, the newly created tensors are created on the CPU if not specified otherwise.
Now is the time to start extracting audio files…
Audio Transcription
This section illustrates the strengths of Whisper for transcribing audio in different languages.
The general workflow in this section is the following.

The first two steps are performed with the helper function below. But before that, we need to install the [pytube](https://pytube.io/en/latest/)
library using the followingpip
statement to download the audio from YouTube.
# Install the module
!pip install pytube
# Import the module
from pytube import YouTube
Then, we can implement the helper function as follows:
def video_to_audio(video_URL, destination, final_filename):
# Get the video
video = YouTube(video_URL)
# Convert video to Audio
audio = video.streams.filter(only_audio=True).first()
# Save to destination
output = audio.download(output_path = destination)
_, ext = os.path.splitext(output)
new_file = final_filename + '.mp3'
# Change the name of the file
os.rename(output, new_file)
The function takes three parameters:
video_URL
the full URL of the YouTube video.destination
the location where to save the final audio.final_filename
the name to give to the final audio.
Finally, we can use the function to download the video and convert it into audio.
English transcription
The video used here is a 30 seconds motivational speech on YouTube from Motivation Quickie. Only the first 17 seconds correspond to the true speech and the rest of the speech is noise.
# Video to Audio
video_URL = 'https://www.youtube.com/watch?v=E9lAeMz1DaM'
destination = "."
final_filename = "motivational_speech"
video_to_audio(video_URL, destination, final_filename)
# Audio to text
audio_file = "motivational_speech.mp3"
result = whisper_model.transcribe(audio_file)
# Print the final result
print(result["text"])
videoURL
is the link to the motivational speech.destination
is my current folder corresponding to`.
`motivational_speech
will be the final name of the audio.whisper_model.transcribe(audio_file)
applies the model on the audio file to generate the transcription.- The
transcribe()
function preprocess the audio with a sliding 30-second window, and perform an autoregressive sequence-to-sequence approach to make predictions on each window. - Finally, the
print()
statement generates the following result.
I don't know what that dream is that you have.
I don't care how disappointing it might have been as you've
been working toward that dream.
But that dream that you're holding in your mind that it's possible.
Below is the corresponding video you can play to check the previous output.
Non-English transcription
In addition to English, Whisper can also deal with non-English languages. Let’s have a look at Alassane Dramane Ouattara’s interview on YouTube.
Similarly to the previous approach, we get the video, translate it to audio and get the content.
URL = "https://www.youtube.com/watch?v=D8ztTzHHqiE"
destination = "."
final_filename = "discours_ADO"
video_to_audio(URL, destination, final_filename)
# Run the test
audio_file = "discours_ADO.mp3"
result_ADO = whisper_model.transcribe(audio_file)
# Show the result
print(result_ADO["text"])
→ Video discussion:
→ Model result from the print()
statement.
Below is the final result, and the result is mindblowing 🤯. The only information being misspelled is "Franc CFA" and the model recognized it as "Front CFA" 😀 .
Le Front CFA, vous l'avez toujours défendu, bec et ongle, est-ce que vous
continuez à le faire ou est-ce que vous pensez qu'il faut peut-être changer
les choses sans rentrer trop dans les tailles techniques? Monsieur Perelman,
je vous dirais tout simplement qu'il y a vraiment du n'importe quoi dans ce
débat. Moi, je ne veux pas manquer de modestie, mais j'ai été directeur des
études de la Banque Centrale, j'ai été vice-gouverneur, j'ai été gouverneur
de la Banque Centrale, donc je peux vous dire que je sais de quoi je parle.
Le Front CFA, c'est notre monnaie, c'est la monnaie des pays membres et nous
l'avons acceptée et nous l'avons développée, nous l'avons modifiée. J'étais
là quand la reforme a eu lieu dans les années 1973-1974, alors tout ce débat
est un nonsense. Maintenant, c'est notre monnaie. J'ai quand même eu à
superviser la gestion monétaire et financière de plus de 120 pays dans le
monde quand j'étais au Fonds Monétaire International. Mais je suis bien placé
pour dire que si cette monnaie nous pose problème, écoutez, avec les autres
chefs d'État, nous prendrons les décisions, mais cette monnaie est solide,
elle est appropriée. Les taux de croissance sont parmi les plus élevés sur le
continent africain et même dans le monde. Le Côte d'Ivoire est parmi les dix
pays où le taux de croissance est le plus élevé. Donc c'est un nonsense,
tout simplement, de la démagogie et je ne souhaite même pas continuer ce débat
sur le Front CFA. C'est la monnaie des pays africains qui ont librement
consenti et accepté de se mettre ensemble. Bien sûr, chacun de nous aurait pu
avoir sa monnaie, mais quel serait l'intérêt? Pourquoi les Européens ont
décidé d'avoir une monnaie commune et que nous les Africains ne serons pas en
mesure de le faire? Nous sommes très fiers de cette monnaie, elle marche bien,
s'il y a des adaptations à faire, nous le ferons de manière souveraine.
Non-English transcription into English
In addition to speech recognition, spoken language identification, and voice activity identification, Whisper
is also able to perform speech translation from any language into English.
In this last section, we will generate the English transcription of the following comic French video.
The process does not change that much from what we have seen above. The major change is the use of the task
parameter in the transcribe()
function.
URL = "https://www.youtube.com/watch?v=hz5xWgjSUlk"
final_filename = "comic"
video_to_audio(URL, destination, final_filename)
# Run the test
audio_file = "comic.mp3"
french_to_english = whisper_model.transcribe(audio_file, task = 'translate')
# Show the result
print(french_to_english["text"])
task='translate'
means that we are performing a translation task. Below is the final result.
I was asked to make a speech. I'm going to tell you right away,
ladies and gentlemen, that I'm going to speak without saying anything.
I know, you think that if he has nothing to say, he would better shut up.
It's too easy. It's too easy. Would you like me to do it like all those who
have nothing to say and who keep it for themselves? Well, no, ladies and
gentlemen, when I have nothing to say, I want people to know. I want to make
others enjoy it, and if you, ladies and gentlemen, have nothing to say, well,
we'll talk about it. We'll talk about it, I'm not an enemy of the colloquium.
But tell me, if we talk about nothing, what are we going to talk about? Well,
about nothing. Because nothing is not nothing, the proof is that we can
subtract it. Nothing minus nothing equals less than nothing. So if we can find
less than nothing, it means that nothing is already something. We can buy
something with nothing by multiplying it. Well, once nothing, it's nothing.
Twice nothing, it's not much. But three times nothing, for three times nothing,
we can already buy something. And for cheap! Now, if you multiply three times
nothing by three times nothing, nothing multiplied by nothing equals nothing,
three multiplied by three equals nine, it's nothing new. Well, let's talk
about something else, let's talk about the situation, let's talk about the
situation without specifying which one. If you allow me, I'll briefly go over
the history of the situation, whatever it is. A few months ago, remember,
the situation, not to be worse than today's, was not better either. Already,
we were heading towards the catastrophe and we knew it. We were aware of it,
because we should not believe that the person in charge of yesterday was more
ignorant of the situation than those of today. Besides, they are the same.
Yes, the catastrophe where the pension was for tomorrow, that is to say that
in fact it should be for today, by the way. If my calculations are right,
but what do we see today? That it is still for tomorrow. So I ask you the
question, ladies and gentlemen, is it by always putting the catastrophe that
we could do the day after tomorrow, that we will avoid it? I would like to
point out that if the current government is not capable of taking on the
catastrophe, it is possible that the opposition will take it.
Conclusion
Congratulations 🎉 ! You have just learned how to perform speech-to-text and have also applied machine translation! There are so many use cases that can be solved from this model.
If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!