CNNs for Audio Classification

A primer in deep learning for audio classification using tensorflow

Papia Nandi

Published in

Towards Data Science

9 min readMar 24, 2021

Convolutional Neural Nets

CNNs or convolutional neural nets are a type of deep learning algorithm that does really well at learning images.

That’s because they can learn patterns that are translation invariant and have spatial hierarchies (F. Chollet, 2018).

That means if If the CNN learns the dog in the left corner of the image above, then it can identify the dog in the other two pictures that have been moved around (translation invariance).

If the CNN learns the dog from the left corner of the image above, it will recognize pieces of the original image in the other two pictures because it has learned what the edges of the her eye with heterochromia looks like, her wolf-like snout and the shape of her stylish headphones (spatial hierarchies).

These properties make CNNs formidable learners for images because the real world doesn’t always look exactly like the training data.

Can I use this for audio?

Yes. You can extract features which look like images and shape them in a way in order to feed them into a CNN.

This article explains how to train a CNN to classify species based on audio information.

The data for this example are bird and frog recordings from the Kaggle competition Rainforest Connection Species Audio Detection.

To get started, load the necessary inputs:

import pandas as pd
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pickle
import joblib
from sklearn.model_selection import train_test_split
from tensorflow.keras import models, layers
import tensorflow as tf

Then the dataframe:

os.chdir('/kaggle/input/rfcx-species-audio-detection')
df = pd.read_csv('train_tp.csv')

This dataset comes as a csv file with the names of audio files listed under recording_id, labels under species_id, and the start/end of the audio sample under t_min and t_max:

df.head()

Use the librosa package to load and display an audio file like this:

sample_num=3 #pick a file to display
#get the filename 
filename=df.recording_id[sample_num]+str('.flac')
#define the beginning time of the signal
tstart = df.t_min[sample_num] 
tend = df.t_max[sample_num] #define the end time of the signal
y,sr=librosa.load('train/'+str(filename)) #load the file
librosa.display.waveplot(y,sr=sr, x_axis='time', color='cyan')

The tricky part

The CNN is expecting an image:

a grayscale image (1 channel)
a color image with three channels: red, green and blue (RGB)

So you have to make your audio features look like an image.

Choose either 1D for a grayscale image (one feature) or 3D for a color image (to represent multiple features).
Scale and pad the audio features so that every “channel” is the same size.

#This code was adapted from Nicolas Gervais on https://stackoverflow.com/questions/59241216/padding-numpy-arrays-to-a-specific-size on 1/10/2021def padding(array, xx, yy):
    """
    :param array: numpy array
    :param xx: desired height
    :param yy: desirex width
    :return: padded array
    """h = array.shape[0]
    w = array.shape[1]a = max((xx - h) // 2,0)
    aa = max(0,xx - a - h)b = max(0,(yy - w) // 2)
    bb = max(yy - b - w,0)return np.pad(array, pad_width=((a, aa), (b, bb)), mode='constant')

Can’t I just reshape my audio features into a 3D shape by dividing it into 3 equal parts?

They’re just numbers after all.

No. It has to make visual sense. Garbage in, garbage out.

Features for modeling

Librosa has great tutorials for how to extract features here.

For this example, I’m going to calculate:

Mel spectrogram (MFCCs)
spectral bandwidth
spectral centroid
chromagram (chroma stft)
short-time Fourier transform (stft)

the 3D image input into a CNN is a 4D tensor

The first axis will be the audio file id, representing the batch in tensorflow-speak. In this example, the second axis is the spectral bandwidth, centroid and chromagram repeated, padded and fit into the shape of the third axis (the stft) and the fourth axis (the MFCCs).

#The eventual shape of the features
print(X_train.shape,X_test.shape)

The first axis 1226 is the batch size, 128 is the height, 1000 is the width (set by max_size in the code below) and 3 is the number of channels in the training data. If I have 1226 audio files, then the batch size is 1226. If we only extracted features for the 5 audio files pictured in the dataframe.head() figure, the shape of the input would be 5x128x1000x3. You can make the batch size smaller if you want to use less memory when training. For this example, the batch size is set to the number of audio files.

def generate_features(y_cut):
    max_size=1000 #my max audio file feature width
    stft = padding(np.abs(librosa.stft(y_cut, n_fft=255, hop_length        = 512)), 128, max_size)
    MFCCs = padding(librosa.feature.mfcc(y_cut, n_fft=n_fft, hop_length=hop_length,n_mfcc=128),128,max_size)
    spec_centroid = librosa.feature.spectral_centroid(y=y_cut, sr=sr)
    chroma_stft = librosa.feature.chroma_stft(y=y_cut, sr=sr)
    spec_bw = librosa.feature.spectral_bandwidth(y=y_cut, sr=sr)    #Now the padding part
    image = np.array([padding(normalize(spec_bw),1, max_size)]).reshape(1,max_size)
    image = np.append(image,padding(normalize(spec_centroid),1, max_size), axis=0) #repeat the padded spec_bw,spec_centroid and chroma stft until they are stft and MFCC-sized
    for i in range(0,9):
        image = np.append(image,padding(normalize(spec_bw),1, max_size), axis=0)
        image = np.append(image, padding(normalize(spec_centroid),1, max_size), axis=0)
        image = np.append(image, padding(normalize(chroma_stft),12, max_size), axis=0)
    image=np.dstack((image,np.abs(stft)))
    image=np.dstack((image,MFCCs))
    return image

The following three features get squished and padded and repeated…

…into the following axis:

The last two axis are designed to be the same shape:

Do I have to calculate these exact same features?

No. As long as you pad them to be the same shape, use what works the best in modeling.

X=df.drop('species_id',axis=1)
y=df.species_id

Extract training, test and validation sets

#Split once to get the test and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123, stratify=y)
print(X_train.shape,X_test.shape)

#Split twice to get the validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape, len(y_train), len(y_test), len(y_val))

Calculate these features for every audio file and store as features and labels:

def get_features(df_in):   
    features=[]     
    labels = [] #empty array to store labels     
    #For each species, determine how many augmentations are needed
    df_in=df_in.reset_index()     
    for i in df_in.species_id.unique():
           print('species_id:',i)    
           #all the file indices with the same species_id     
           filelist = df_in.loc[df_in.species_id == i].index         
    for j in range(0,len(filelist)):             
           filename = df_in.iloc[filelist[j]].recording_id
            +str('.flac') #get the filename   
            #define the beginning time of the signal          
            tstart = df_in.iloc[filelist[j]].t_min             
            tend = df_in.iloc[filelist[j]].t_max #end of signal
            recording_id = df_in.iloc[filelist[j]].recording_id
            species_id = i
            songtype_id = df_in.iloc[filelist[j]].songtype_id   
            #Load the file
            y, sr = librosa.load(filename,sr=28000)  
            #cut the file to signal start and end  
            y_cut=y[int(round(tstart*sr)):int(round(tend*sr))]  
            #generate features & output numpy array          
            data = generate_features(y_cut) 
            features.append(data[np.newaxis,...])    
            labels.append(species_id)     
     output=np.concatenate(features,axis=0)     
     return(np.array(output), labels)#use get_features to calculate and store the features
test_features, test_labels = get_features(pd.concat([X_test,y_test],axis=1))
train_features, train_labels = get_features_noOS(pd.concat([X_train,y_train],axis=1))

Normalize the data and cast into a numpy array

X_train = np.array((X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)))
X_test = np.array((X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)))
X_train = X_train/np.std(X_train)
X_test = X_test/np.std(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

Create a CNN

In the example model below, a 2D Convolutional Layer (Conv2D) unit is the portion that learns the translation invariant spatial patterns and their spatial hierarchies.

The Max Pooling Layer halves the size of the feature maps by downsampling them to the max value inside a window. Why downsample? Because otherwise it would result in a ginormous number of parameters and your computer would blow up and after all that the model would massively overfit the data. This magical layer is the reason that a CNN can handle the huge amounts of data in images. Max Pooling does a model good.

The Dropout Layer guards against overfitting by randomly setting the weights of a portion of the data to zero, and the Dense units contain hidden layers tied to the degrees of freedom the model has to try and fit the data. The more complex the data, the more degrees of freedom the model needs. Take care not to add a bunch of these and end up overfitting the data.

The Flatten Layer squishes all the feature map information into a single column in order to feed in into a Dense layer, the last of which outputs the 24 species that the model is supposed to classify the audio recordings into.

A sample CNN model architecture

In tensorflow, you create the above model like this

input_shape=(128,1000,3)
CNNmodel = models.Sequential()
CNNmodel.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
CNNmodel.add(layers.MaxPooling2D((2, 2)))
CNNmodel.add(layers.Dropout(0.2))
CNNmodel.add(layers.Conv2D(64, (3, 3), activation='relu'))
CNNmodel.add(layers.MaxPooling2D((2, 2)))
CNNmodel.add(layers.Dropout(0.2))
CNNmodel.add(layers.Conv2D(64, (3, 3), activation='relu'))
CNNmodel.add(layers.Flatten())
CNNmodel.add(layers.Dense(64, activation='relu'))
CNNmodel.add(layers.Dropout(0.2))
CNNmodel.add(layers.Dense(32, activation='relu'))
CNNmodel.add(layers.Dense(24, activation='softmax'))

The activation functions give the model the ability to add nonlinearity to the model. Here, the relu function is used, which zeros out negative weights. You can read about other activation functions here, but this is a good one to start with. The last Dense layer’s activation function type is softmax, which outputs a probability for each class.

Compile the model

CNNmodel.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['accuracy'])

The Adam optimizer manages the learning rate for you, the loss function is used to evaluate how different the predicted and actual data are and penalizes the model for poor predictions. In this example, the loss function is SparseCategoricalCrossentropy, which is used when each sample belongs to one label, as opposed to more than one, and it’s not binary classification. This is an appropriate choice because each audio sample belongs to one species and there are 24 of them.

Fit the model

history = CNNmodel.fit(X_train, y_train, epochs=20, validation_data= (X_val, y_val))

To avoid overfitting, start with the simplest model and work your way up

This is because if the model is overly complex, it will learn your training data exactly, and fail to generalize to unseen data.

Try this:

input_shape=(128,1000,3)
CNNmodel = models.Sequential()
CNNmodel.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
CNNmodel.add(layers.MaxPooling2D((2, 2)))
CNNmodel.add(layers.Flatten())
CNNmodel.add(layers.Dense(32, activation='relu'))
CNNmodel.add(layers.Dense(24, activation='softmax'))
CNNmodel.summary()

Note: This model was too simple and was not able to predict the data at all (as in single digit accuracy).

Next, add layers until your model has start to overfit the data.

Evaluate your model train and validation set

Watch for big differences in performance between the training and test set. If the training set performs markedly better, it won’t generalize well to unseen data.
If performance on the validation set begins to decline, stop iterating.

#Adapted from Deep Learning with Python by Francois Chollet, 2018
history_dict=history.history
loss_values=history_dict['loss']
acc_values=history_dict['accuracy']
val_loss_values = history_dict['val_loss']
val_acc_values=history_dict['val_accuracy']
epochs=range(1,21)
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))
ax1.plot(epochs,loss_values,'bo',label='Training Loss')
ax1.plot(epochs,val_loss_values,'orange', label='Validation Loss')
ax1.set_title('Training and validation loss')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('Loss')
ax1.legend()
ax2.plot(epochs,acc_values,'bo', label='Training accuracy')
ax2.plot(epochs,val_acc_values,'orange',label='Validation accuracy')
ax2.set_title('Training and validation accuracy')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend()
plt.show()

Concluding Remarks

You now know how to create a CNN for use in audio classification. Start with a simple model, and then add layers until it is you start seeing signs that the training data is performing better than the test data. Add Dropout and Max Pooling layers to prevent overfitting. Lastly, stop iterating when you note a decrease in performance in the validation data in comparison to the training data.

Happy modeling!

Sources

Sarkar, Dipanjan (2021) Personal communication.

Chollet, F. Deep Learning with Python (2018), v. 361, New York: Manning.

Gervias, Nicolas, (2021) Code adopted from https://stackoverflow.com/questions/59241216/padding-numpy-arrays-to-a-specific-size, retrieved on 1/10/2021.

frenzykryger (2021) https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy#:~:text=Use%20sparse%20categorical%20crossentropy%20when,0.5%2C%200.3%2C%200.2%5D, retrieved on 2/21/2021.