Terms you need to know to start Speech Processing with Deep Learning

Published in

Towards Data Science

5 min readMay 10, 2021

We all love our Alexa’s and our Siri’s but do you know that you can make one on your own from scratch?

In the coming series of blogs, we are going to discuss more on audio and, deep learning research around audio. But first, let’s take some time to understand the basic terms related to processing audio.

Through this blog I will cover the following topics that would enable you to delve deeper into text-to-speech and speech processing:

Audio
Fourier Transform
Short-time Fourier transform and Spectrogram
Mel-Spectrogram
Phoneme

Audio

Our primary mode of communication is through speech and we listen to audio constantly. But how does a computer understand audio? A computer can only understand numbers, so we take samples of air pressure over a period of time and that becomes an audio clip. The rate at which we sample this data can vary, but most commonly we take it at 44.1 kHz(frequency).

Above is the digital representation of an audio. Let’s assume that I have sampled this data at 16kHz, it means that in one second there are a sequence of 16000 amplitudes. So if the audio is of 10 seconds, the total amplitudes will be 16000*10 and that is a lot! So how do we extract the necessary information from this giant set of amplitudes? This is where Fourier transform helps us.

Fourier Transform

An audio signal comprises of several single-frequency sound waves. In the above representation, we were only able to see the wave that results from the addition of amplitudes of all the waves at different frequencies, at each time step. Fourier transform helps us here by decomposing a signal into its individual frequencies and the amplitude corresponding to that frequency. In other words, we can say that we are converting the signal from the time domain to the frequency domain. This conversion is possible only because every audio signal can be broken into a summation of cosine and sine waves.

Fast Fourier transform (FFT) is just a faster and efficient algorithm to compute Fourier transform. Below is the FFT of the audio sample we visualized earlier.

FFT for the above audio (Image by author)

The above diagram shows that this particular audio signal has low-frequency waves with higher amplitudes and high-frequency waves with lower amplitudes.

In general, in an audio clip, the amplitude of different frequency waves varies with time. So we can have an audio clip, where some patches only have high-frequency waves with high amplitudes and, some other patches in the same clip only have low-frequency waves with high amplitudes. Now if we see the FFT for that whole audio, then essentially the resultant spectrum will have an averaged out amplitude for low as well as high frequencies since it will be averaged out over the whole clip. On the other hand, we can calculate multiple FFTs corresponding to multiple patches of audio, obtained by splitting the original clips at fixed intervals of time. These sets of FFT will actually be more informative about the changes happening in the original clip because they represent the local information correctly.

Short time Fourier Transform and Spectrogram

We saw that multiple FFTs are more beneficial to us so we use Short-Time Fourier Transform (STFT). STFT divides a longer audio signal into shorter segments of equal length and then computes the Fourier transform separately on each shorter segment.

Window length is the length of the fixed intervals in which STFT divides the signal.
Hop length is the length of the non-intersecting portion of window length.
Overlap length is the length of the intersecting portion of the window length.

For the purpose of representation of STFT, we use a spectrogram. In the spectrogram, one unit of the y-axis corresponds to the frequency in the log domain and one unit of the x-axis corresponds to the window length used to compute STFT. The value at (x, y) represents the amplitude(in dB scale) corresponding to the window time and frequency. Here dB scale is like a log scale for amplitude.

Mel Spectrogram

The Mel Scale, mathematically speaking, is the result of a non-linear transformation of the frequency scale. Mel Scale is such that it closely represents how perceptive the human ear is to the difference between two unique sounds. So if the human ear can clearly understand the difference between two sounds they will be further apart from each other on the Mel Scale than two other sounds which a human can not clearly tell apart.
This is better than the Hz scale because on the Hz scale the difference between 500-1000 Hz and 7500-8000 Hz is equal but to the human ear the difference between 500–1000 Hz is quite noticeable whereas for 7500–8000 Hz we can barely notice the difference.

So we convert our above spectrogram to a Mel-spectrogram by converting the frequency to Mel scale to get a more practical representation of our data.

Phoneme:

A phoneme is the smallest unit of sound that makes a word’s pronunciation and meaning different from another word. For instance, the /s/ in ‘soar’ distinguishes it from /r/ in ‘roar’, as it becomes different from ‘soar’ in pronunciation as well as meaning.

Summing Up

Audio is nothing but a list of air pressure amplitude that we can sample at different frequencies.
FFT is used to convert signals from the time domain to the frequency domain. But converting whole audio using FFT will lead to loss of information, hence we use STFT.
STFT uses a sliding-window FFT for the audio.
For the representation of STFT, we use a spectrogram.
Spectrogram has frequency in the log domain and when we convert it to Mel scale the resulting spectrogram becomes Mel-spectrogram.

I hope that this blog gives you the necessary information about the terms frequently used in speech processing. To learn about different deep learning algorithms used in speech processing check out this blog.

Become a Medium member to unlock and read many other stories on medium. Follow us on Medium for reading more such blog posts.