Bird Song Classification using Siamese Networks and Dilated Convolutions

Aditya Dutt
Towards Data Science
9 min readJul 4, 2021

--

Introduction

Bioacoustics is very useful in studying the environment. It has been used for a long time to track submarines and whales. Birds help a lot to shape the plant life we see around us. Recognizing bird songs is very important for automatic monitoring of wildlife and studying the behavior of birds. It can help in tracking birds without disturbing them. We can also tell which birds species exist at a particular place. It can give us some information about their migration pattern. Every species of birds have their unique sounds. They use songs of varying length and complexity to attract mates, warn other birds of nearby danger, and mark their territory. Songbirds can have different dialects based on their geographical location. But sounds from non-songbirds do not change much based on geography. Using Deep learning methods, we can easily classify birds based on their calls and songs. We can use any of the neural network frameworks like CNNs, Siamese Networks, WaveNets, etc. for this task.

Problem Statement

Goal: We want to classify different birds species given their song/ call audio samples. We can extract spectrograms of the audio samples and use them as features for classification. We will use the British Birdsong Dataset available on Kaggle for this experiment. The dataset is described in the section Data Description.

A Quick Introduction to Siamese Networks and Dilated Convolutions

Siamese Networks

I have written an article on Siamese Network before. You can check it out to get a deeper understanding of its working and loss functions used for it. Code is also provided in that article. But, I will give a summary of the Siamese Network here.

A Siamese network is a class of neural networks that contains one or more identical networks. We feed a pair of inputs to these networks. Each network computes the features of one input. And, then the similarity of features is computed using their difference or the dot product. For same class input pairs, the target output is 1 and for different classes input pairs, the output is 0. Remember, both networks have same the parameters and weights. If not, then they are not Siamese.

Siamese Network basic structure

Different loss functions can be used for a Siamese network.

  1. Contrastive Loss: In Contrastive loss, a pair of inputs is taken. For same class pairs, distance is less between them. For different pairs, distance is more. Although binary cross-entropy seems like a perfect loss function for our problem, contrastive loss does a better job differentiating between image pairs. Contrastive loss, L = Y * D² + (1-Y) * max(margin — D, 0)²
  2. Triplet Loss: Triplet loss was introduced by Google in 2015 for face recognition. Here, the model takes three inputs- anchor, positive, and negative. The anchor is a reference input. Positive input belongs to the same class as anchor input. Negative input belongs to a random class other than the anchor class. We have to minimize the distance between the anchor and the positive sample and simultaneously maximize the distance between the anchor and the negative sample.

We will use triplet loss for our experiment.

Dilated Convolutions

Dilated Convolutions are a type of convolution that “inflate” the kernel by inserting holes between the kernel elements. They are also called atrous convolutions.

The concept of Dilated Convolution came from the wavelet decomposition in which the mother wavelet is scaled or dilated by different scales to capture different frequencies.

(a) Standard 3 x 3 kernel, (b) Kernel with a dilation factor 2, (c) Kernel with a dilation factor of 4. Source Original paper

Dilated convolutions increase the receptive field with the same computation and memory costs and without the loss of resolution. It can capture context from the entire input with the same number of parameters.

Here is a nice article on Dilated Convolutions by Sik-Ho Tsang.

Data Description

We are using the British Birdsong Dataset available on Kaggle for this experiment. It is a small subset gathered from the Xeno Canto data collection to form a balanced dataset of 88 bird species in the United Kingdom.

We are only classifying 9 bird species here: Canada Goose, Carrion Crow, Coal Tit, Common Blackbird, Common Chaffinch, Common Chiffchaff, Common Linnet, Common Moorhen, and Common Nightingale.

There are very few samples in this dataset for each bird. The audio samples are around 40-60 seconds long. Some of them are a little noisy and sometimes there are other birds in the background. Clips of 2 seconds are extracted from each sample with 50% overlap to create new samples. This will create a sufficient number of samples for training the neural network. Data is divided into 60% for training, and 40% for testing.

Implementation

Feature Extraction

Librosa library in python is used for music and audio analysis. We can read audio files and extract spectrograms using it.

Step 1: Read the audio file using librosa. Normalize the time series between -1 and 1.

Step 2: Remove silence from audio.

Step 3: Split each audio file into 2 second long clips with a 50% overlap.

Step 4: Divide the data into training and testing. 60% is used for training and 40% for testing.

Step 5: Extract spectrograms from samples. A filter is applied to the spectrograms to get a frequency range between 1KHz and 8KHz as most bird songs have frequencies in that range. Now, standardize all the spectrograms (You can normalize them also between 0 and 1). Here, the shape of each spectrogram is 163 x 345.

Sample Spectrogram of Common BlackBird Song

Step 6: Generate positive and negative pairs of samples for the Siamese Network.

Firstly, generate positive pairs.

Now, generate negative pairs of classes.

You can generate both positive and negative pairs using the function below. It takes as input: input features, target class labels, number of random numbers taken from each class, and number of positive pairs. It returns anchor, positive, and negative samples.

Now the data is ready for the Siamese Network.

We have 3 types of inputs: anchor, positive, and negative samples. The shape of each input is: (10800 x 345 x 163).

Now, we need to build a neural network.

Build the Neural Network

The encoder model contains 8 1-D convolution layers with exponentially increasing dilation factors. After that, a 1D convolution layer is applied to decrease the number of features. Finally, Global MaxPooling 1D layer is applied. A batch normalization layer is applied after each layer. All layers have a ‘reLu’ activation except the last one. A ‘tanh’ activation is applied on the last layer. At the final layer, we got a 32 -dimensional vector as output.

Three instances of this encoder model are created. They represent anchor, positive, and negative input. All three 32-dimensional feature vectors are concatenated into a 96-dimensional vector. This concatenated vector is treated as output. As you can see below in code, the function triplet_loss takes the output and separates the 3 embeddings again, and computes the loss.

Below you can see the code for the model.

Here is the encoder model summary:

Encoder Model Summary
Encoder Model Architecture

Now we have a complete Siamese Network.

Train the Model

We can fit the model now. The target output of the model is a dummy output. It is because we are not comparing the output of the model to the target output. Instead, we are just minimizing distances between same class embeddings and pushing away different classes embeddings.

Our input size of anchor, positive, and negative samples is: (10800 x 345 x 163). The batch size is set to 256. The model converged in only 30 epochs thanks to Batch Normalization. You can train longer if you wish to.

Results

The accuracy of the model is 98.1% on the training set and 97.3% on the test dataset.

Here is the normalized confusion matrix of the test dataset:

Confusion matrix of the test dataset

Here is the similarity matrix of embeddings:

Similarity Matrix of the test dataset embeddings

Below is a scatter plot of test dataset embeddings after applying PCA on them. There is a good separation between all the classes. And, all classes are clustered together except a few samples.

A scatter plot of test dataset embeddings after applying PCA

Important Points

  1. Here spectrograms are used as features. Mel-spectrograms can be used as well.
  2. Wavelet transform of audio can also be used as features.
  3. Experiments can be conducted with longer frame sizes. Too long frame size will overfit the model and decrease the performance overall.
  4. The Batch Normalization layer plays a very important role. It normalizes the mini-batches during training and solves the problem of internal covariate shift. It makes training faster and the model becomes robust. I highly recommend using this in your models.
  5. An ensemble of models could definitely increase the accuracy further if you want to classify more species of birds.

Conclusion

Siamese Networks successfully classified birds based on their songs with a 97% accuracy. Siamese Networks convert the problem of classification to a similarity problem. They can work with fewer samples as well because we are generating pairs of samples. The model with dilated 1-D convolutions along with the batch normalization layer converged very quickly.

Here is the GitHub repository for this project.

Future Experiment Ideas

This model can be extended to classify more than 50 or 100 birds. I don’t know much about birds because it’s not my field of research but I am very curious to see if we can identify different geographical locations based on bird songs. It’s because Songbirds' dialect changes with geographical locations. Here is a GitHub repository that contains a list of datasets related to birds. Xeno-canto website contains bird sounds from all over the world. It has data of different countries, species, etc. You can select a dataset from here for your own project.

Remember, you can use this model for similar audio-based tasks also like speaker classification, emotion detection, etc.

Thank you so much for reading! I hope it was helpful.🐧

Feel free to contact me, if you want to collaborate on a project or need some ideas for your projects.

References

--

--