
Recurrent Neural Nets
RNNs or Recurrent Neural nets are a type of deep learning algorithm that can remember sequences.
What kind of sequences?
- Handwriting/speech recognition
- Time series
- Text for natural language processing
- Things that depend on a previous item
Does that mean audio?
Yes. Unless the Audio is a random stream of garbage (not the band), audio information tends to follow a pattern.
Behold the first two measures of Beethoven’s Moonlight Sonata:

Pretty repetitive! How do you think he kept writing music after he lost his hearing? Pattern recognition and memory.
Also genius.
This article explains how to train an RNN to classify species based on audio information.
The data for this example are bird and frog recordings from the Kaggle competition Rainforest Connection Species Audio Detection. They’re adorable.

To get started, load the necessary imports:
import pandas as pd
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
import tensorflow
from tensorflow.keras.layers import LSTM, Dense
Then the dataframe:
os.chdir('/kaggle/input/rfcx-species-audio-detection')
df = pd.read_csv('train_tp.csv')
This dataset comes as a csv file with the names of audio files listed under recording_id, labels under species_id, and the start/end of the audio sample under t_min and t_max:
df.head()

Use the librosa package to load and display an audio file like this:
sample_num=3 #pick a file to display
filename=df.recording_id[sample_num]+str('.flac') #get the filename
#define the beginning time of the signal
tstart = df.t_min[sample_num]
tend = df.t_max[sample_num] #define the end time of the signal
y,sr=librosa.load('train/'+str(filename))
librosa.display.waveplot(y,sr=sr, x_axis='time', color='purple',offset=0.0)

Features for modeling
Librosa has great tutorials on how to extract features here. For RNNs, I found that the best feature were the Mel-frequency cepstral coefficients (MFCCs), a spectral feature of sound. You can calculate it like this:
hop_length = 512 #the default spacing between frames
n_fft = 255 #number of samples
#cut the sample to the relevant times
y_cut=y[int(round(tstart*sr)):int(round(tend*sr))]
MFCCs = librosa.feature.mfcc(y_cut, n_fft=n_fft,hop_length=hop_length,n_mfcc=128)
fig, ax = plt.subplots(figsize=(20,7))
librosa.display.specshow(MFCCs,sr=sr, cmap='cool',hop_length=hop_length)
ax.set_xlabel('Time', fontsize=15)
ax.set_title('MFCC', size=20)
plt.colorbar()
plt.show()

Extract features & labels for all the files and store in a numpy array:
def get_features(df_in):
features=[] #list to save features
labels=[] #list to save labels
for index in range(0,len(df_in)):
#get the filename
filename = df_in.iloc[index]['recording_id']+str('.flac')
#cut to start of signal
tstart = df_in.iloc[index]['t_min']
#cut to end of signal
tend = df_in.iloc[index]['t_max']
#save labels
species_id = df_in.iloc[index]['species_id']
#load the file
y, sr = librosa.load('train/'+filename,sr=28000)
#cut the file from tstart to tend
y_cut = y[round(tstart*sr,ndigits=None)
:round(tend*sr, ndigits= None)]
data = np.array([padding(librosa.feature.mfcc(y_cut,
n_fft=n_fft,hop_length=hop_length,n_mfcc=128),1,400)])
features.append(data)
labels.append(species_id)
output=np.concatenate(features,axis=0)
return(np.array(output), labels)
X,y=get_features(df)
Normalize the data and cast into a numpy array
X = np.array((X-np.min(X))/(np.max(X)-np.min(X)))
X = X/np.std(X)
y = np.array(y)
Extract training, test and validation datasets
#Split twice to get the validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=123)
#Print the shapes
X_train.shape, X_test.shape, X_val.shape, len(y_train), len(y_test), len(y_val)

Create an RNN
In this example model, a Long Short-Term Memory (LSTM) unit is the portion that does the remembering, the Dropout randomly sets the weights of a portion of the data to zero to guard against overfitting, and the Dense units contain hidden layers tied to the degrees of freedom the model has to try and fit the data. The more complex the data, the more degrees of freedom the model needs _all the while taking care to avoid overfitting (_more on this later). The last Dense layer outputs the 24 species that the model is supposed to classify the audio recordings into.
A sample RNN model architecture

In tensorflow, you can create the above RNN model like this:
input_shape=(128,1000)
model = keras.Sequential()
model.add(LSTM(128,input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(48, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(24, activation='softmax'))
model.summary()
The activation functions add nonlinearity to the model. Here, the relu function is used, which zeros out negative weights. The last Dense layer’s activation function is softmax, which outputs a probability for each class. Tensorflow has other activation functions that you can read about here.
The input shape can be confusing because even though it appears to be 2D, it’s actually 3D. Because of the parameters chosen when they were created, the shape of these MFCCs happen to be 128 in height and 1000 in length, and there are as many of them as there are audio files. If we only extracted features for the 5 audio files pictured in the dataframe.head() figure, the shape of the input would be 5x128x1000. This parameter is called the batch size and it’s not included in the input shape.
Compile the model
model.compile(optimizer='adam',loss='SparseCategoricalCrossentropy',metrics=['acc'])
The Adam optimizer manages the learning rate for stochastic gradient descent, and is a good one to start with. The loss function is SparseCategoricalCrossentropy, which is used when each sample belongs to one label.), as opposed to more than one, and it’s not binary classification. That’s the case for this classification problem where each audio sample belongs to one species.
Fit the model
history = model.fit(X_train, y_train, epochs=50, batch_size=72,
validation_data=(X_val, y_val), shuffle=False)
Some words on overfitting

BUT you have to overfit temporarily in order to know where the boundary between overfitting and underfitting is (F. Chollet, 2018).
Start with the simplest model and work your way up
Try this:
input_shape=(128,1000)
model = tensorflow.keras.Sequential()
model.add(LSTM(NUM,input_shape=input_shape))
model.add(Dense(24, activation='softmax'))
model.summary()
where NUM is some number larger than your output layer, and add layers until your model starts to overfit the data.
How do you know when this happens?
Evaluate your model train and validation set
- If the performance between the training and test set is different (training accuracy is 99% and test is 89% for example), you have overfit the data.
- When the validation measure of choice (accuracy in this case) begins to decrease, stop iterating. In the graphs below, this happens around 50 epochs.
#Adapted from Deep Learning with Python by Francois Chollet, 2018
history_dict=history.history
loss_values=history_dict['loss']
acc_values=history_dict['acc']
val_loss_values = history_dict['val_loss']
val_acc_values=history_dict['val_acc']
epochs=range(1,51)
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))
ax1.plot(epochs,loss_values,'co',label='Training Loss')
ax1.plot(epochs,val_loss_values,'m', label='Validation Loss')
ax1.set_title('Training and validation loss')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('Loss')
ax1.legend()
ax2.plot(epochs,acc_values,'co', label='Training accuracy')
ax2.plot(epochs,val_acc_values,'m',label='Validation accuracy')
ax2.set_title('Training and validation accuracy')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend()
plt.show()

Check how well the model predicts using a confusion matrix:
If all of the entries line up on the diagonal of the matrix, the model has made perfect predictions on the test set. Anything else has been misclassified.
TrainLoss, Trainacc = model.evaluate(X_train,y_train)
TestLoss, Testacc = model.evaluate(X_test, y_test)
y_pred=model.predict(X_test)
print('Confusion_matrix: ',tf.math.confusion_matrix(y_test, np.argmax(y_pred,axis=1)))

And there you have it
You now know how to create an Rnn using audio data by starting with a simple model, and adding layers until it is able to predict the data to the best of its ability. Modify the architecture until your model begins to overfit the data to understand where this boundary is, then go back and remove layers. Look for discrepancies in performance between the training and test data and add Dropout layers to prevent overfitting to the training data. Look for a decrease in performance in the validation data to know when to stop iterating.
Happy modeling!
Sources
Sarkar, Dipanjan (2021) Personal communication.
Chollet, F. Deep Learning with Python (2018), v. 361, New York: Manning.
Gervias, Nicolas, (2021) **** Code adopted from https://stackoverflow.com/questions/59241216/padding-numpy-arrays-to-a-specific-size, retrieved on 1/10/2021.
frenzykryger (2021) https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy#:~:text=Use%20sparse%20categorical%20crossentropy%20when,0.5%2C%200.3%2C%200.2%5D, retrieved on 2/21/2021.