Speech-to-Text with OpenAI’s Whisper
Easy speech to text
OpenAI has recently released a new speech recognition model called Whisper. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model.
Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise and technical language. In addition, it supports 99 different languages’ transcription and translation from those languages into English.
This article explains how to convert speech into text using the Whisper model and Python. And, it won’t cover how the model works or the model architecture. You can check more about the Whisper here.
Whisper has five models (refer to the below table). Below is the table available on OpenAI’s GitHub page. According to OpenAI, four models for English-only applications, which is denoted as .en
. The model performs better for tiny.en
and base.en
, however, differences would become less significant for the small.en
and medium.en
models.
For this article, I am converting Youtube video into audio and passing the audio into a whisper…