Getting to Know the Mel Spectrogram

Published in

Towards Data Science

5 min readAug 19, 2019

Read this short post if you want to be like Neo and know all about the Mel Spectrogram! (Ho maybe not all, but at least a little)

For the tl;dr and full code, go here.

A Real Conversation That Happened in My Head a Few Days Ago

Me: Hi Mel Spectrogram, may I call you Mel?
Mel: Sure.

Me: Thanks. So Mel, when we first met, you were quite the enigma to me.
Mel: Really? How’s that?

Me: You are composed of two concepts that their whole purpose is to make abstract notions accessible to humans - the Mel Scale and Spectrogram - yet you yourself were quite difficult for me, a human, to understand.
Mel: Is there a point to this one-sided speech?

Me: And do you know what bothered me even more? I heard through the grapevine that you are quite the buzzz in DSP (Digital Signal Processing), yet I found very little intuitive information about you online.
Mel: Should I feel bad for you?

Me: So anyway, I didn’t want to let you be misunderstood, so I decided to write about you.
Mel: Gee. That’s actually kinda nice. Hope more people will get me now.

Me: With pleasure my friend. I think we can talk about what are your core elements, and then show some nice tricks using the librosa package on python.
Mel: Oooh that’s great! I love librosa! It can generate me with one line of code!

Me: Wonderful! And let’s use this beautiful whale song as our toy example throughout this post! What do you think?
Mel: You know you’re talking to yourself right?

The Spectrogram

Visualizing sound is kind of a trippy concept. There are some mesmerizing ways to do that, and also more mathematical ones, which we will explore in this post.

Photo credit: Chelsea Davis. See more of this beautiful artwork here.

When we talk about sound, we generally talk about a sequence of vibrations in varying pressure strengths, so to visualize sound kinda means to visualize airwaves.

But this is just a two dimensional representation of this complex and rich whale song!
Another mathematical representation of sound is the Fourier Transform. Without going into too many details (watch this educational video for a comprehensible explanation), Fourier Transform is a function that gets a signal in the time domain as input, and outputs its decomposition into frequencies.

Let’s take for example one short time window and see what we get from applying the Fourier Transform.

Now let’s take the complete whale song, separate it to time windows, and apply the Fourier Transform on each time window.

Wow can’t see much here can we? It’s because most sounds humans hear are concentrated in very small frequency and amplitude ranges.

Let’s make another small adjustment - transform both the y-axis (frequency) to log scale, and the “color” axis (amplitude) to Decibels, which is kinda the log scale of amplitudes.

Now this is what we call a Spectrogram!

The Mel Scale

Let’s forget for a moment about all these lovely visualization and talk math. The Mel Scale, mathematically speaking, is the result of some non-linear transformation of the frequency scale. This Mel Scale is constructed such that sounds of equal distance from each other on the Mel Scale, also “sound” to humans as they are equal in distance from one another.
In contrast to Hz scale, where the difference between 500 and 1000 Hz is obvious, whereas the difference between 7500 and 8000 Hz is barely noticeable.

Luckily, someone computed this non-linear transformation for us, and all we need to do to apply it is use the appropriate command from librosa.

Yup. That’s it.
But what does this give us?
It partitions the Hz scale into bins, and transforms each bin into a corresponding bin in the Mel Scale, using overlapping triangular filters.

Now what does this give us?
Now we can take the amplitude of one time window, compute the dot product with mel to perform the transformation, and get a visualization of the sound in this new frequency scale.

Mmmm, still doesn’t mean much, he?
OK let’s wrap it all up and see what we get.

The Mel Spectrogram

We know now what is a Spectrogram, and also what is the Mel Scale, so the Mel Spectrogram, is, rather surprisingly, a Spectrogram with the Mel Scale as its y axis.

And this is how you generate a Mel Spectrogram with one line of code, and display it nicely using just 3 more:

Recap

The Mel Spectrogram is the result of the following pipeline:

Separate to windows: Sample the input with windows of size n_fft=2048, making hops of size hop_length=512 each time to sample the next window.
Compute FFT (Fast Fourier Transform) for each window to transform from time domain to frequency domain.
Generate a Mel scale: Take the entire frequency spectrum, and separate it into n_mels=128 evenly spaced frequencies.
And what do we mean by evenly spaced? not by distance on the frequency dimension, but distance as it is heard by the human ear.
Generate Spectrogram: For each window, decompose the magnitude of the signal into its components, corresponding to the frequencies in the mel scale.

To Be Continued…

Now that we know Mel Spectrogram as well as Neo, what are we going to do with it?

Well, that’s for another post…

In the meanwhile, got any wild projects you are working on using the Mel Spectrogram? Please share in the comments!