Build a deep neural network for the keyword spotting (KWS) task with nnAudio GPU audio processing

Published in

Towards Data Science

8 min readJul 13, 2022

Dealing with audio can complicate any machine learning task. In this tutorial, we go over how to build a neural network in PyTorch by directly feeding it audio files that are directly converted into finetunable spectrograms. To do this, we use nnAudio [1] and PyTorch.

This tutorial will build a classifier on the Google speech commands dataset v2 for the Key Word Spotting (KWS) task. KWS is a sound classification problem. Our model will predict the word (text) that matches an input sound file. There are a total of 12 different output classes in this KWS task. We chose to work with 10 out of all available 35 words, the remaining 25 words are grouped in the class ‘unknown’. A class ‘silence’ is created from background noise.

The Google speech commands dataset v2 has 105,829 utterances with 35 single words in total. This dataset includes some of the wordings such as ‘Yes’, ‘No’, ‘Up’, ‘Down’, ‘Left’, ‘Right’, ‘On’, ‘Off’, ‘Stop’, and ‘Go’. Each utterance is 1 sec long.

Before deciding on an architecture for this problem, it’s important to consider how we will be processing audio. Will we be using spectrogram image stored on our computer, or wavenet, or something else? To answer this, let’s go back in time a few years to when my PhD student Cheuk Kin Wai (Raven) at SUTD. Raven wanted to build an audio transcription model. He quickly noticed that it was cumbersome and slow to extract spectrograms and store them on his computer before training the model, making it impossible to tweak the spectrogram settings. Hence, he developed the nnAudio [1] library, which provides a useful open source tool to load audio directly into a PyTorch layer in which they are dynamically converted to spectrogram representation.

nnAudio uses a PyTorch 1D Convolution neural network as its backend. Thus, it optimises the waveform to spectrogram conversion process. This enables the basis functions of the discrete Fourier transform to be made trainable by the neural network, which means that they can be optimized for the task at hand. This is possible because the short-time Fourier transform (STFT) and Mel basis functions are implemented as a first layer of the neural network (front-end layer). During training, the model is able to back-propagate the gradient to the front-end. Hence, the most important features from audio signals can be ‘captured’ through custom trained spectrograms. As we will see, this results in a nice and quick way for us to train an audio classification task.

The keyword spotting (KWS) tutorial below consists of four parts:

Part 1: Loading the dataset & simple linear model
Part 2: Training a Linear model with Trainable Basis Functions
Part 3: Evaluation of the resulting model
Part 4: Using more complex non-linear models

Let’s start!

Part 1: loading the dataset & simple linear model

In this tutorial, we will work with spectrograms. In audio deep learning, spectrograms play an important role: to connect audio files and deep learning models. Front-end tools such as librosa and nnAudio convert audio waveforms (time-domain) to spectrograms (time-frequency-domain), which can basically be processed in a similar way as images are processed by the model.

Loading the dataset

First we need to access the data from Google. We use AudioLoader to access the speech commands 12 classes dataset.

AudioLoader is a one-click-ready audio loader which allows you to download, unzip, do data split, and resample audio in one click. It now supports multiple popular datasets such as MusicNet, MusdbHQ, TIMIT.

In a traditional setup, we would first extract spectrograms, save them to our computer, and then load these images into the model. This is slow, requires disk space, and makes it hard to tune spectrogram features to the task at hand. nnAudio solves these issues by calculating spectrograms on-the-fly as part of the neural network.

nnAudio can calculate different types of spectrograms such as short-time Fourier transform (STFT), Mel-spectrogram, and constant-Q transform (CQT) by leveraging PyTorch and GPU processing. Processing audio on the GPU shortens the computation time by up to 100x as you can see in the below Figure from the original nnAudio paper [1].

Processing times to compute different types of spectrograms with nnAudio GPU (green), nnAudio CPU (orange), and librosa (blue) represented on a logarithmic scale [1].

Defining the model

We will first demonstrate our workflow with a simple single layer network. The model is defined as Linearmodel_nnAudio and inherit the class SpeechCommand (LightningModule). You can go to setting up the Lightning Module for more details of the parent class.

The Lightning Module is a module in PyTorch Lightning. It can help you to organise your PyTorch code into 6 sections which includes train Loop (training_step), test Loop (test_step), optimizers and lr schedulers (configure_optimizers).

In this project, we opted to work with Mel-spectrograms, as these spectrogram bins are scaled to match the human hearing spectrum. Hence they may form a good representation of what features we humans pick up.

This simple model takes sound files (x) as input. Then we use the method nnAudio.features.mel.MelSpectrogram() in the Linearmodel_nnAudio as the first layer of the neural network which converts the sound files from audio waveforms into spectrograms.

For demonstration purposes, we only define a simple model with one additional linear layer here. The output of this KWS classification task consists of 12 classes, hence the output size of the layer should be 12.

The resulting code for nnAudio is below. Note that the nnAudio.features.mel.MelSpectrogram() function has many additional parameters which will allow you to have fine grained control over the spectrograms if desired.

Linear model with nnAudio

Training the model for 1 epoch

Below is the code to train our simple linear model for 1 epoch with PyTorch Lightning.

Training the Linearmodel with nnAudio

Training the above models results in a computation time 17s to finish one epoch which is about 95x faster than if we were to use librosa.

Part 2: Training a Linear model with Trainable Basis Functions

In this session, we will demonstrate how to utilise nnAudio’s trainable basis functions to build a powerful classifier in which the spectrograms are actually finetuned to the task at hand during backpropagation.

Setting up the basis functions

In nnAudio, we can make the Fourier kernel trainable, which allows it to be tweaked to our specific task at hand during backpropagation.

You can modify the Mel-spectrogram argument in the function below, with trainable_mel and trainable_STFT controlling if the Mel and STFT bases are trainable.

Training the model

We use a fairly standard training loop, during which the trained model weights will be saved in the lighting_logs folder. In the next step of the tutorial, we will have a closer look at the performance of this model.

Part 3: Evaluation of the resulting model

You have trained a linear model, but now it is time to evaluate the model performance and do some visualisations.

Loading pre-trained weight to model

Every time you train a model, the Lighting module will save the trained weights in the lightning_logs folder.

Evaluating the model performance

The model performance on the KWS task can be evaluated using the following metrics on the test set:

Test/Loss (cross-entropy)
Test/acc (accuracy)
F1 matrix (F1 scores)
Confusion_matrix

The final accuracy on the test set is displayed below with slightly varying settings for allowing the kernels to be trainable.

Test/acc in KWS task for Linear model with different trainable basis functions setting

When looking at these results, keep in mind that we have a 12-class problem, so a random oracle would have 8.3% accuracy .From the table above, we can see that by using trainable basis functions, we can boost the accuracy on the KWS task by 14.2 percentage points compared to a simple linear model. Further tweaking the hyperparameters would no doubt increase performance even more.

Visualising the result (bonus)

We can visualise some of the learned kernels within our 1st layer of nnAudio as the weights are stored in our checkpoint file. The structure inside the checkpoint file looks like this:

weight=torch.load('xxxx/checkpoints/xxxx.ckpt')
├── epoch
├── global_step
├── pytorch-lightning_version
│     
├── state_dict
│     ├─ mel_layer.mel_basis
│     ├─ mel_layer.stft.wsin
│     ├─ mel_layer.stft.wcos
│     ├─ mel_layer.stft.window_mask   
│     ├─ linearlayer.weight
│     ├─ linearlayer.bias
│     │
│     
├── callbacks
├── optimizer_states
├── lr_schedulers

state_dict is one of the dictionary keys in the checkpoint file. It is an OrderedDict which includes the trained weights for basis functions (e.g. Mel bins, STFT) and layer weights (the linear layer in this case).

Visualising the Mel bins

The shape of mel_layer.mel_basis should be [n_mels, (n_fft/2+1)], whereby n_mels is number of Mel bins and n_fft refers to the length of the windowed signal after padding with zeros. In this tutorial example, the shape of mel_layer.mel_basis is [40,241].

Here is the comparison of the non-trainable Mel bins and trainable Mel bins at 200 epochs. The 40 Mel bins are each displayed in a different colour below:

Notice how they vary a lot, and are specifically attuned to the important frequencies and patterns in our specific task.

Visualizing the STFT

The shape of mel_layer.stft.wsin and mel_layer.stft.wcos should be [(n_fft/2+1), 1, n_fft]. In this tutorial example, the shape for both is [241, 1, 480].

Here is the comparison of non-trainable STFT and trainable STFT at 200 epochs. From left to right, top to bottom, they are 2nd, 10th, 20th, 50th Fourier kernels respectively.

Part 4: Using Trainable Basis Functions with more complex non-linear models

After following part 1–3 of the tutorial, you now have a big picture overview of how to use nnAudio with trainable basis functions. In this last part of the tutorial, you can see how you can easily adapt the model function to fit any type of deep, complex neural networks that fits your needs. The only changes required are in the model definition, as per the example below, where we use Broadcasting-residual network after the nnAudio spectrogram layer.

Check out a complete, more complex tutorial example here!

Conclusion

We have walked through a step-by-step tutorial for building a model for key word spotting with Google speech commands dataset v2 by using nnAudio [1] trainable basis functions. This approach can easily be modified to fit any audio classification task. Let us know your feedback and what you are doing with the code in the comments!

Acknowledgements
Thanks to my research assistant Heung Kwan Yee for preparing the code snippets and to Cheuk Kin Wai for creating nnAudio!

References

K. W. Cheuk, H. Anderson, K. Agres and D. Herremans, “nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks,” in IEEE Access, vol. 8, pp. 161981–162003, 2020, doi: 10.1109/ACCESS.2020.3019084.
For this tutorial source code, you can refer to https://github.com/heungky/nnAudio_tutorial
For the nnAudio source code, you can refer to https://github.com/KinWaiCheuk/nnAudio
For the nnAudio documentation, you can refer to https://kinwaicheuk.github.io/nnAudio/index.html
Heung, K. Y., Cheuk, K. W., & Herremans, D. (2022). Understanding Audio Features via Trainable Basis Functions. arXiv preprint arXiv:2204.11437. (Research article on trainable basis functions)