Using CNNs and RNNs for Music Genre Recognition

Doing cool things with data!

Published in

Towards Data Science

8 min readDec 13, 2018

Audio data is becoming an important part of machine learning. We are using audio to interact with smart agents like Siri and Alexa. Audio will also be important for self-driving cars so they can not only “see” their surroundings but “hear” them as well. I wanted to explore deep learning techniques on audio files and music analysis seems to be an interesting area with lots of promising research. In this blog I have looked at architectures that combine CNNs and RNNs to classify music clips into 8 different genres. I have also visualized filter activations in different CNN layers. If you are new to deep learning and want to learn about CNNs and deep learning for computer vision, please checkout my blog here.

The code for this project is available on my Github. My hope here is that the techniques mentioned here have broader application than just on music. Hope you enjoy!

Data Set and Conversion to Mel-Spectograms

There are a few different datasets with music data — GTZan and Million Songs data set (MSD) are 2 of the ones most commonly used. But both of these data sets have limitations. GTZan only has 100 songs per genre and MSD has well 1 million songs but only their metadata, no audio files. I decided to use the Free Music Archive Small dataset. You can use their github link to download the small dataset (8 GB) which has raw audio files + metadata. The FMA small data set that I used had 8 genres and 1000 songs per genre evenly distributed. The eight genres are Electronic, Experimental, Folk, Hip-Hop, Instrumental, International, Pop and Rock.

Converting audio data into mel-spectogram

Each audio file was converted into a spectogram which is a visual representation of spectrum of frequencies over time. A regular spectogram is squared magnitude of the short term Fourier transform (STFT) of the audio signal. This regular spectogram is squashed using mel scale to convert the
audio frequencies into something a human is more able to understand. I used the built in function in the librosa library to convert the audio file directly into a mel-spectogram. The most important parameters used in the transformation are — window length which indicates the window of time to perform Fourier Transform on and hop length which is the number of samples between successive frames. The typical window length for this transformation is 2048 which converts to about 10ms, the shortest reasonable period a human ear can distinguish. I chose hop length of 512. Further more the Mel-spectrograms produced by Librosa were scaled by a log function. This maps
the sound data to the normal logarithmic scale used to determine loudness in decibels (dB) as it relates to the human-perceived pitch. As a result of this transformation each audio file gets converted to a mel-spectogram of shape — 640, 128.

As seen in Figure 1, different genres have noticeable differences in their mel-spectogram which gives us confidence in using a CNN to do the classification.

Spectogram — Pop(TL), Instrumental (TR), Experimental (BL) and Folk(BR)

Librosa makes it super easy to create spectograms. A 3 line code can convert an audio file into a spectogram!

y, sr = librosa.load(filename)
spect = librosa.feature.melspectrogram(y=y, sr=sr,n_fft=2048, hop_length=512)
spect = librosa.power_to_db(spect, ref=np.max)

To speed up training, I divided the data set into train, test and validation, converted their respective audio files into spectograms and picked the results so this pickled data could directly be loaded. You can find this data on the drive link.

Building CNN-RNN Models

Why use CNNs and RNNs?

One question that arises is why do we need to use both CNNs and RNNs. A spectogram is a visual representation of audio across frequency and time dimension. CNN makes sense since spectograms of a song are kind of like an image, each with their own distinct patterns. RNNs excel in understand sequential data by making the hidden state at time t dependent on hidden state at time t-1. The spectograms have a time component and RNNs can do a much better job of identifying the short term and longer term temporal features in the song.

I tried 2 interesting CNN-RNN architectures — Convolutional Recurrent Model and a Parallel CNN-RNN model. Both models were coded in Keras and you can find the code on my Github.

Convolutional Recurrent Model

The inspiration of this model comes from the work by deep sound and from the paper by Keunwoo Choi et al. This model uses 1D CNNs that perform
convolution operation just across the time dimension. Each 1D convolution layer extracts features from a small slice of the spectogram. RELU activation
is applied after the Convolution operation. Batch normalization is done and finally 1D Max Pooling is performed which reduces spatial dimension of the image and prevents over fitting.
This chain of operations — 1D Convolution — RELU — Batch Normalization — 1D Max Pooling is performed 3 times. The output from 1D Convolution Layer is fed into a LSTM which should find short term and long term structure of the song. The LSTM uses 96 hidden units. The output from LSTM is passed into a
Dense Layer of 64 units. The final output layer of the model is a dense layer with Softmax activation and 8 hidden units to assign probability to the 8 classes. Both dropout and L2 regularization were used between all the layers to reduce over fitting of the model. Figure below shows the overall
architecture of the model.

The model was trained using Adam optimizer with a learning rate of 0.001 and the loss function was categorical cross entropy. The model was trained for 70 epochs and Learning Rate was reduced if the validation accuracy plateaued for at least 10 epochs.

See below the loss and accuracy curves for training and validation samples. As seen, the model has low bias but high variance implying the model is over fitting a bit to training even after using several regularization techniques.

Overall this model got to around 53% accuracy on the validation set.

Parallel CNN-RNN Model

Inspired by the work of Lin Fen and Shenlen Liu, I also tried a Parallel CNN-RNN Model. The key idea behind this network is that even though CRNN has RNNs to be the temporal summarizer, it can only summarize temporal information from the output of CNNs. The temporal relationships of original musical signals are not preserved during operations with CNNs. This model passes the input spectogram through both CNN and RNN layers in parallel, concatenating their output and then sending this through a dense layer with softmax activation to perform classification as shown below.

The convolutional block of the model consists of 2D convolution layer followed by a 2D Max pooling layer. This is in contrast to the CRNN model that uses 1D convolution and max pooling layers. There are 5 blocks of Convolution Max pooling layers. The final output is flattened and is a tensor of shape None , 256.
The recurrent block starts with 2D max pooling layer of pool size 4,2 to reduce the size of the spectogram before LSTM operation. This feature reduction was done primarily to speed up processing. The reduced image is sent to a bidirectional GRU with 64 units. The output from this layer is a
tensor of shape None, 128.
The outputs from the convolutional and recurrent blocks are then concatenated resulting in a tensor of shape, None, 384. Finally we have a dense layer with softmax activation.
The model was trained using RMSProp optimizer with a learning rate of 0.0005 and the loss function was categorical cross entropy. The model was trained for 50 epochs and Learning Rate was reduced if the validation accuracy plateaued for at least 10 epochs.

Figure below shows the loss and accuracy curves from this model

This model had a validation loss of around 51%. Both models have very similar overall accuracies which is quite interesting but their class wise performance is very different. Parallel CNN-RNN model has a better performance for Experimental, Folk, Hip-Hop and Instrumental genres. The ensembling of both these models should produce even better results.

One question that I asked myself was why is the accuracy only around 51%. I think there are 2 reasons for this:

The sample size of 1000 spectograms is actually a very small sample for building a deep learning model from scratch. As seen by the loss curves, both models overfit quickly. Ideally we have a bigger training set or one thing that can be tried is breaking each song into 3 segments of 10 seconds each with the same label and using this to triple the size of the data set.
The FMA data set is challenging and in particular it has a few classes like experimental and international which are easy to confuse among. The top leader board score on FMA -Genre Recognition challenge has a test F1 score of only around 0.63

Visualizing Activations in different layers using Keras Visualization toolkit

I explored the features learned by initial vs later layers of the convolution model. For this analysis I used the Keras Visualization package and selected Parallel CNN-RNN model as this uses the 2D CNN layers which were easier to visualize.

The first convolution block in this model has 16 filters and the fifth convolution block has 64 filters. To understand what the filter is focusing on, we look at what kind of input maximizes the activations in that filter. Figure below shows the filter activations of all 16 filters in the first convolution blocks vs the first 24 filters of the fifth convolution block.

Filter Activations for first and fifth convolutional block

What I observe here is that the filters for the first layer are relatively straightforward. There are looking at a small kernel size of 3,1. As such they are focusing on different patterns of fluctuations between primarily 50–200 db. In the fifth convolution block, the same filter is looking at a bigger
size of the input image due to feature maps being shrunk as a result of convolution and max pooling operations. It is now able to focus on different features like sharp increases in amplitude to 250 db as well as periods of very low amplitude coloured as black region.

I hope you like this analysis and try the code for yourself.

I have my own deep learning consultancy and love to work on interesting problems. I have helped many startups deploy innovative AI based solutions. Check us out at — http://deeplearninganalytics.org/.

You can also see my other writings at: https://medium.com/@priya.dwivedi

If you have a project that we can collaborate on, then please contact me through my website or at info@deeplearninganalytics.org