Speech-to-Text with OpenAI’s Whisper

Easy speech to text

Published in

Towards Data Science

4 min readOct 1, 2022

Photo by Guillaume de Germain on Unsplash

OpenAI has recently released a new speech recognition model called Whisper. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model.

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise and technical language. In addition, it supports 99 different languages’ transcription and translation from those languages into English.

This article explains how to convert speech into text using the Whisper model and Python. And, it won’t cover how the model works or the model architecture. You can check more about the Whisper here.

Whisper has five models (refer to the below table). Below is the table available on OpenAI’s GitHub page. According to OpenAI, four models for English-only applications, which is denoted as .en. The model performs better for tiny.en and base.en, however, differences would become less significant for the small.en and medium.en models.

For this article, I am converting Youtube video into audio and passing the audio into a whisper…

Great article. I find Whisper fascinating. I feel that Whisper can really be used for the good, but can it also be used for wrong-doing? I wrote something about the topic myself:

https://medium.com/p/16f6b49bd4c7