The Kaggle Blueprints

Welcome to another edition of "The [Kaggle](https://www.kaggle.com/) Blueprints", where we will analyze Kaggle competitions’ winning solutions for lessons we can apply to our own Data Science projects.
This edition will review the techniques and approaches from the "BirdCLEF 2022" competition, which ended in May 2022.
Problem Statement: Audio Classification with Domain Shift
The objective of the "BirdCLEF 2022" competition was to identify Hawaiian bird species by sound. The competitors were given short audio files of single bird calls and were asked to predict whether a specific bird was present in a longer recording.
In contrast to a vanilla audio classification problem, this competition added flavor with the following challenges:
- Domain shift – The training data consisted of clean audio recordings of a single bird call separated from any additional sounds (a few seconds, different lengths). However, the test data consisted of "unclean" longer (1 minute) recordings taken "in the wild" and contained different sounds other than bird calls (e.g., wind, rain, other animals, etc.).

- Class imbalance/Few-shot learning -As some birds are less common than others, we are dealing with a long-tailed class distribution where some birds only have one sample.

Insert your data here! – To follow along in this article, your dataset should look something like this:

Approaching Audio Classification as an Image Classification Problem with Deep Learning
A popular approach among competitors to this audio classification problem was to:
- Converting the audio classification problem to an image classification problem by converting the audio from waveform to a Mel spectrogram and applying a Deep Learning model
- Applying data augmentations to the audio data in waveform and in spectrograms to tackle the domain shift and class imbalance
- Finetune a pre-trained image classification model to tackle class imbalance
This article will use PyTorch (version 1.13.0) for the Deep Learning framework and [torchaudio](https://pytorch.org/audio/stable/index.html)
(version 0.13.0) and [librosa](https://librosa.org/doc/main/index.html)
(version 0.10.0) for audio processing. Additionally, we will be using [timm](https://timm.fast.ai/)
(version 0.6.12) for fine-tuning with pre-trained image models.
# Deep Learning framework
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader
# Audio processing
import torchaudio
import torchaudio.transforms as T
import librosa
# Pre-trained image models
import timm
Preparations: Getting Familiar with Audio Data
Before getting started with solving an audio classification problem, let’s first get familiar with working with audio data. You can load the audio and its sampling rate from different file formats (e.g., .wav, .ogg, etc.) with the .load()
method from the torchaudio
library or the librosa
library.
PATH = "audio_example.wav"
# Load a sample audio file with torchaudio
original_audio, sample_rate = torchaudio.load(PATH)
# Load a sample audio file with librosa
original_audio, sample_rate = librosa.load(PATH,
sr = None) # Gotcha: Set sr to None to get original sampling rate. Otherwise the default is 22050
If you want to listen to the loaded audio directly in a Jupyter notebook for explorations, the following code will provide you with an audio player.
# Play the audio in Jupyter notebook
from IPython.display import Audio
Audio(data = original_audio, rate = sample_rate)

The [librosa](https://librosa.org/doc/main/index.html)
library also provides various methods to display the audio data for exploration purposes quickly. If you used [torchaudio](https://pytorch.org/audio/stable/index.html)
to load the audio file, make sure to convert the tensors to NumPy arrays.
import librosa.display as dsp
dsp.waveshow(original_audio, sr = sample_rate);
![Original audio data of the word "stop" in waveform from the "Speech Commands" dataset [0]](https://towardsdatascience.com/wp-content/uploads/2023/04/08_PUF3tkRnIP3hKg.png)
Step 1: Convert the Audio Classification Problem to an Image Classification Problem
A popular method to model audio data with a Deep Learning model is to convert the "computer hearing" problem to a computer vision problem [2]. Specifically, the waveform audio is converted to a Mel spectrogram (which is a type of image) as shown below.

Usually, you would use a Fast Fourier Transform (FFT) to computationally convert an audio signal from the time domain (waveform) to the frequency domain (spectrogram).
However, the FFT will give you the overall frequency components for the entire time series of the audio signal as a whole. Thus, you are losing the time information when converting audio data from the time domain to the frequency domain.
Instead of the FFT, you can use the Short-Time Fourier Transform (STFT) to preserve the time information. The STFT is a variant of the FFT that breaks up the audio signal into smaller sections by using a sliding time window. It takes the FFT on each section and then combines them.
n_fft
-length of the sliding window (default: 2048)hop_length
– number of samples by which to slide the window (default: 512). Thehop_length
will directly impact the resulting image size. If your audio data has a fixed length and you want to convert the waveform to a fixed image size, you can sethop_length = audio_length // (image_size[1] - 1)

Next, you will convert the amplitude to decibels and bin the frequencies according to the Mel scale. For this purpose, n_mels
is the number of frequency bands (Mel bins). This will be the height of the resulting spectrogram.

For an in-depth explanation of the Mel spectrogram, I recommend this article:
Below you can see an example PyTorch Dataset
which loads an audio file and converts the waveform to a Mel spectrogram after some preprocessing steps.
class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length
wave_transforms=None,
spec_transforms=None):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length
self.wave_transforms = wave_transforms
self.spec_transforms = spec_transforms
def __len__(self):
return len(self.df)
def __getitem__(self, index):
# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])
# Convert to mono
audio = torch.mean(audio, axis=0)
# Resample
if sample_rate != self.target_sample_rate:
resample = T.Resample(sample_rate, self.target_sample_rate)
audio = resample(audio)
# Adjust number of samples
if audio.shape[0] > self.num_samples:
# Crop
audio = audio[:self.num_samples]
elif audio.shape[0] < self.num_samples:
# Pad
audio = F.pad(audio, (0, self.num_samples - audio.shape[0]))
# Add any preprocessing you like here
# (e.g., noise removal, etc.)
...
# Add any data augmentations for waveform you like here
# (e.g., noise injection, shifting time, changing speed and pitch)
...
# Convert to Mel spectrogram
melspectrogram = T.MelSpectrogram(sample_rate = self.target_sample_rate,
n_mels = 128,
n_fft = 2048,
hop_length = 512)
melspec = melspectrogram(audio)
# Add any data augmentations for spectrogram you like here
# (e.g., Mixup, cutmix, time masking, frequency masking)
...
return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}
Your resulting dataset should produce samples that look something like this before we feed it to the neural network:

Step 2: Apply Augmentations to Audio Data
One technique to tackle this competition’s challenges of domain shift and class imbalance was to apply data augmentations to the training data [5, 8, 10, 11]. You can apply data augmentations for audio data in the waveform and the spectrogram. The [torchaudio](https://pytorch.org/audio/stable/index.html)
library already provides a lot of different data augmentations for audio data.
Popular data augmentation techniques for audio data in waveform (time domain) are:
- Noise injection like white noise, colored noise, or background noise (
[AddNoise](https://pytorch.org/audio/stable/generated/torchaudio.transforms.AddNoise.html#torchaudio.transforms.AddNoise)
) - Shifting time
- Changing speed (
[Speed](https://pytorch.org/audio/stable/generated/torchaudio.transforms.Speed.html#torchaudio.transforms.Speed)
; alternatively use[TimeStretch](https://pytorch.org/audio/stable/generated/torchaudio.transforms.TimeStretch.html#torchaudio.transforms.TimeStretch)
in frequency domain) - Changing pitch (
[PitchShift](https://pytorch.org/audio/stable/generated/torchaudio.transforms.PitchShift.html#torchaudio.transforms.PitchShift)
)

Popular data augmentation techniques for audio data in the spectrogram (frequency domain) are:
- Popular image augmentation techniques like Mixup [13] or Cutmix [3]
![Data Augmentation for Spectrogram: Mixup [13]](https://towardsdatascience.com/wp-content/uploads/2023/04/0yijgOMqL4JhKSXCc.png)
- SpecAugment [7] (
[FrequencyMasking](https://pytorch.org/audio/stable/generated/torchaudio.transforms.FrequencyMasking.html#torchaudio.transforms.FrequencyMasking)
and[TimeMasking](https://pytorch.org/audio/stable/generated/torchaudio.transforms.TimeMasking.html#torchaudio.transforms.TimeMasking)
)
![Data Augmentation for Spectrogram: SpecAugment [7]](https://towardsdatascience.com/wp-content/uploads/2023/04/0zCvyef9lsPte41Qw.png)
As you can see while providing a lot of audio augmentations, [torchaudio](https://pytorch.org/audio/stable/index.html)
doesn’t provide all of the proposed data augmentations.
Thus, if you want to inject a specific type of noise, shift the time, or apply Mixup [13] or Cutmix [12] augmentations, you must write a custom data augmentation in PyTorch. You can reference this collection of audio data augmentation techniques for their implementations:
In the example PyTorch Dataset
class from before, you can apply the data augmentations as follows:
class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length
def __len__(self):
return len(self.df)
def __getitem__(self, index):
# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])
# Add any preprocessing you like here
# (e.g., converting to mono, resampling, adjusting size, noise removal, etc.)
...
# Add any data augmentations for waveform you like here
# (e.g., noise injection, shifting time, changing speed and pitch)
wave_transforms = T.PitchShift(sample_rate, 4)
audio = wave_transforms(audio)
# Convert to Mel spectrogram
melspec = ...
# Add any data augmentations for spectrogram you like here
# (e.g., Mixup, cutmix, time masking, frequency masking)
spec_transforms = T.FrequencyMasking(freq_mask_param=80)
melspec = spec_transforms(melspec)
return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}
Step 3: Fine-tune a Pretrained Image Classification Model for Few-Shot Learning
In this competition, we are dealing with a class imbalance. As some classes only have one sample, we are dealing with a few-shot learning problem. Nakamura and Harada [6] showed in 2019 that fine-tuning could be an effective approach to few-shot learning.
A lot of competitors [2, 5, 8, 10, 11] fine-tuned common pre-trained image classification models such as
- EfficientNet (e.g.,
tf_efficientnet_b3_ns
) [9], - SE-ResNext (e.g.,
se_resnext50_32x4d
) [3], - NFNet (e.g.,
eca_nfnet_l0
) [1]
You can load any pre-trained image classification model with the [timm](https://timm.fast.ai/)
library for fine-tuning. Make sure to set in_chans = 1
as we are not working with 3-channel images but 1-channel Mel spectrograms.
class AudioModel(nn.Module):
def __init__(self,
model_name = 'tf_efficientnet_b3_ns',
pretrained = True,
num_classes):
super(AudioModel, self).__init__()
self.model = timm.create_model(model_name,
pretrained = pretrained,
in_chans = 1)
self.in_features = self.model.classifier.in_features
self.model.classifier = nn.Sequential(
nn.Linear(self.in_features, num_classes)
)
def forward(self, images):
logits = self.model(images)
return logits
Other competitors reported successes from fine-tuning models pre-trained on similar audio classification problems [4, 10].
Fine-tuning is done with a cosine annealing learning rate scheduler ([CosineAnnealingLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR)
) for a few epochs [2, 8].
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
T_max = ..., # Maximum number of iterations.
eta_min = ...) # Minimum learning rate.

You can find more tips and best practices in this guide for fine-tuning Deep Learning models:
Summary
There are many more lessons to be learned from reviewing the learning resources Kagglers have created during the course of the "BirdCLEF 2022" competition. There are also many different solutions for this type of problem statement.
In this article, we focused on the general approach that was popular among many competitors:
- Converting the audio classification problem to an image classification problem by converting the audio from waveform to a Mel spectrogram and applying a Deep Learning model
- Applying data augmentations to the audio data in waveform and in spectrograms to tackle the domain shift and class imbalance
- Finetune a pre-trained image classification model to tackle class imbalance
Enjoyed This Story?
Subscribe for free to get notified when I publish a new story.
Find me on LinkedIn, Twitter, and Kaggle!
References
Dataset
As the original competition data does not allow commercial use, examples are done with the following dataset.
[0] Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
License: CC-BY-4.0
Image References
If not otherwise stated, all images are created by the author.
Web & Literature
[1] Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021, July). High-performance large-scale image recognition without normalization. In International Conference on Machine Learning (pp. 1059–1071). PMLR.
[2] Chai Time Data Science (2022). BirdCLEF 2022: 11th Pos Gold Solution | Gilles Vandewiele (accessed March 13th, 2023)
[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).
[4] Kramarenko Vladislav (2022). 4th place in Kaggle Discussions (accessed March 13th, 2023)
[5] LeonShangguan (2022). [Public #1 Private #2] + [Private #7/8 (potential)] solutions. The host wins. in Kaggle Discussions (accessed March 13th, 2023)
[6] Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216.
[7] Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
[8] slime (2022). 3rd place solution in Kaggle Discussions (accessed March 13th, 2023)
[9] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). PMLR.
[10] Volodymyr (2022). 1st place solution models (it’s not all BirdNet) in Kaggle Discussions (accessed March 13th, 2023)
[11] yokuyama (2022). 5th place solution in Kaggle Discussions (accessed March 13th, 2023)
[12] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6023–6032).
[13] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.