The world’s leading publication for data science, AI, and ML professionals.

Detecting Sounds with Deep Learning

How to convert audio to images and analyze it with ResNeSt

ResNeSt for Audio

Created together with Dmytro Karabash, Maxim Korotkov, Tony Chen.

Have you ever woken up without understanding what it was, but knowing for sure that some sound isn’t right?

Sound identification is one of our instincts that kept human beings safe. Sounds play a significant role in our life, Starting from recognizing a predator nearby to being inspired by music, to groups of human voices, to the cry of a bird. Inevitably, developing Audio classifiers is a crucial task in our lives.

Ordinarily, it is essential to classify the sounds’ source and is already widely used for various purposes. In music, there’s a classifier for the genre of music. Recently similar systems began to be used to classify birdcalls, historically done by Ornithologists. Their goal is to categorize birds, considering it is challenging to discover birdcalls from the fields or noisy surroundings.

Recently, Deep Learning (DL) has grown one of the popular technologies to solve multiple tasks in our lives due to its accuracy and the improvement of computational devices like CPU (Central Processing Unit), GPU (Graphics Processing Unit). The below chart shows how influential the deep learning market is and its expected size from the aspects of the software, hardware, and services.

In this post, We will take the task of reading an audio file with zero to few birdcalls. Moreover, using deep learning to identify which bird it is, based on the Cornell Birdcall Identification Challenge, where we acquired a silver medal (top 2%).

How to deal with the data?

We can find that countless articles about processing audio data into a spectrogram, along with explained how to load sound data, including making it to a spectrogram format and why it is critical. Here’s an example of a spectrogram of birdcall of Alder Flycatcher and a photo of such a bird, just in case you are curious.

The speed of data processing is one of the keys to employing a deep learning model. Conversely, the increment of computation power, the computation cost of audio processing is still expensive on a CPU. Nevertheless, if we choose a better computation resource to process the data like a GPU, it can boost the speed of about ten to one hundred times faster! We will show how to process spectrogram fast by utilizing a library called torchlibrosa that enables us to process spectrogram on a GPU.

Build a spectrogram processor

torchlibrosa is a Python library that has some audio processing functions implemented in PyTorch that can utilize GPU resources. PyTorch enables running this spectrogram algorithm on a GPU. Here’s an example of extracting spectrogram features using torchlibrosa.

from torchlibrosa.stft import Spectrogram

spectrogram_extractor = Spectrogram(
    win_length=1024, 
    hop_length=320
).cuda()

Load audio data

We can load audio data via the librosa library, one of the popular Python audio processing libraries.

import librosa

# get raw audio data
example, _ = librosa.load('example.wav', sr=32000, mono=True)

Process spectrogram

raw_audio = torch.Tensor(example).unsqueeze(0).cuda()
spectrogram = spectrogram_extractor(raw_audio)

Benchmark speed

We can process audio data on the GPU by adopting the torchlibrosa library. You may wonder how much faster on the GPU than the CPU. Here’s the speed of processing the benchmark between the devices. We just picked audio from the data obtained from the Cornell Birdcall Identification Kaggle Challenge, which is publicly available, and compared how long it takes on CPU and GPU. We tested on the Colab to reproduce the performance, and it is about x15 faster on GPU than CPU to process log-mel spectrogram from about 5 minutes audio.

How to classify a sound?

Accordingly, Deep Learning has shown brilliant performance in the audio domain. It can catch numerous patterns of target classes correctly in the time-series data. The more important point is the environment and data matter in birdcalls. The environments like fields or the middle of the mountains make batches of noises interfering with the birdcalls. Several birds can exist in long recorded audio. Consequently, we need to build a noise-robust, multi-label audio classifier.

We will present a deep learning architecture used by our team (Dragonsong) in Cornell Birdcall Identification Kaggle Challenge.

Architecture

We built a novel audio classifier architecture that effectively catches time-series features by utilizing CNN, RNN, and Attention modules. Here is our brief plot of architecture used at the challenge.

We process raw audio with a log-mel spectrogram as an input of our architecture, and it passes through the ResNeSt50 backbone, which is one of the image Classification architectures. Afterward, we take the features, which contain both spatial and temporal information, to the RoI (Region of Interest) pooling and bi-GRU layers. In the layers, it catches the time-wise information while reducing the feature dimension because we thought extracting temporal features is pivotal to classify numbers of birdcalls in long audio. Ultimately, we pass the data into the attention module to score by each time step to find out which time-step the birds exist.

Train the model

Not only building Deep Learning architecture to represent the data but also how to train the model is vital (a.k.a training recipe). To classify audios that contain various birdcalls with a noisy background, we mix bunches of birdcalls into audio and noises like white noise. Also, regarding many variations of birdcalls, we augment pitch and mask some audio frames by using SpecAugment.

Here is a short example (a mixed version of Alder Flycatcher and American Avocet) of what we applied augmentations.

Summary

Have you ever woken up without understanding what it was, but knowing for sure that some sound isn’t right? With great algorithms, machines will be able to identify what it was and help you sleep better. Stay tuned!

Originally published at YourDataBlog.


Related Articles