Calculating Audio Song Similarity Using Siamese Neural Networks

Published in

Towards Data Science

6 min readAug 28, 2020

Introduction

At AI Music, where our back catalogue of content grows every day, it is becoming increasingly necessary for us to create more intelligent systems for searching and querying the music. One such system for doing that can be dictated by the ability to define and quantify the degree of similarity between songs. The core methodology described here tackles the concept of acoustic similarity.

Searching for a song using descriptive tags often introduces the issue of semantic inconsistencies. Tags can be highly subjective by age group, culture, and personal preference of a listener. For example, descriptors such as ‘bright’ or ‘cold’ could mean entirely different things to different people. Music can also sit in blurry areas when it comes to genre. A song such as Sabotage by the Beastie Boys is primarily known as a Hip-Hop/Rap song, yet it contains a lot of the sonic qualities we would traditionally attribute to a Rock song. The ability to use an example reference track to retrieve a similar song or ranked list of similar songs from a large catalogue avoids such issues.

Nevertheless, when we perceive two or more songs to be similar to one another what does this actually mean? This perceived similarity is often very difficult to define as it comprises a number of different aspects, such as genre, instrumentation, mood, tempo and many more. To complicate the problem further, similarity tends to be made of an unrestricted combination of such characteristics. With song similarity being such a subjective concept, how are we tackling the issue of defining a ground truth?

How did we approach the problem?

Traditional methods for determining the similarity between songs require you to select and extract music features from the audio. How close or far these features are to one another within a space is then presumed to be the perceptual similarity of the respective tracks. One problem when employing this approach is how to determine which features best map to the perceived similarity. At AI Music, we tackle this problem by employing an approach based on Siamese Neural Networks (SNN).

The SNN architecture is based on a Convolutional Neural Network architecture, which means we needed to transform the audio into an image. The most common image representation of audio is a waveform where the signal amplitude is plotted against time. For our application we use a visual representation of the audio known as a spectrogram, specifically a mel spectrogram.

A spectrogram uses the Fourier transform to produce a frequency distribution of the signal against time.
A mel spectrogram is a spectrogram where the frequencies are mapped to the mel scale.
The mel scale is log spaced, resulting in a representation that more closely correlates with human hearing.

We have chosen mel spectrograms as they have been found to be good representations for the timbre of a sound and are therefore better representations of the acoustic characteristics of a song.

Figure 1: Comparison of waveform, spectrogram and mel spectrogram

As we can see from the above image, relevant musical information is revealed more clearly in the mel spectrogram.

Data Pairs

The siamese network is based on 2 ‘branches’, these branches are architecturally identical and share the same weights. The basic functionality is that one branch accepts a ‘reference’ track mel spectrogram as an input while the other accepts a ‘difference’ track mel spectrogram as an input. Data pairs are created for this input by calculating each datapoint’s similarity with one another. Every datapoint is used as the ‘reference’ to generate 5 positive and 5 negative ‘difference’ pairs. A similarity matrix of the dataset is generated by employing a similarity coefficient based on the number of descriptive tags and musical components shared by 2 data points. Similar and dissimilar tracks can then be selected using this matrix.

We generate pairs using the following rules:

5 Positive Pairs:

The reference track itself
4 positive matches from a random selection of the top 10 most similar tracks

5 Negative Pairs:

5 negative matches from a random selection of the top 10 least similar tracks

These image pairs are passed into the model, where a feature vector (embedding) is generated for each. The euclidean distance is used as the distance measure between these feature vectors. The euclidean distance between these 2 feature vectors is calculated, resulting in a similarity score. The contrastive loss is used to dictate how the model weights are updated. The contrastive loss aims to minimise the distance between the extracted feature vectors between similar pairs, and separate dissimilar pairs according to the distance margin calculated based on the metadata.

Figure 1: Siamese neural network flow diagram

How do we use the trained Siamese Model

Once we are left with a trained model we split the siamese network. This leaves us with a single branch without the final layer where the euclidean distance is calculated. In this form the model is essentially a feature extractor. The entire back catalogue of music we wish to be able to search from may then be processed leaving us with a database of corresponding feature vectors. A new ‘unheard song may then be sent through this same model, and the euclidean distance between the resultant feature vector and all feature vectors within the database may be calculated. The results with the lowest scores indicate the most similar songs.

Audio Example 1 — Hip Hop

Audio Example 2— Pop

Audio Example 3— Country

Conclusion & Further work

Employing siamese networks to calculate song similarity means that we allow the system to determine the features that accurately represent the perceived similarity we hope to quantify. This leaves us a song recommendation system that solely relies on the audio signal. That being said, the limitation of creating the ground truth still exists. Coming up with more intelligent ways to build the data pair dataset is one of the larger challenges we need to overcome in the realm of deep learning where large datasets are not as readily available.

In order to not need to calculate and provide a numeric similarity score for the data pairs when training the network, we try using the triplet loss function. Here we would have 3 branches in the network, one for the reference song, one for a positive match and another with a negative match. That means during training the feature vectors of the reference and positive example would be pushed closer to one another while the feature vectors of the reference track and the negative match would be pushed further apart.

The inherent perceptual estimation of song similarity may also change from person to person or use case to use case. Employing conditional similarity networks could allow the user to select which musical features or characteristics they deem to be most important.

Sign up to our newsletter at AI Music, to keep up to date with the research we do here and discover more about the company exploring the ways AI can help shape music production and delivery

References

Pranay Manocha, Rohan Badlani, Anurag Kumar, Ankit Shah, Benjamin Elizalde, and Bhiksha Raj. “Content-based representations of audio using siamese neural networks.” (2018) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Haoting Liang, Donghuo Zeng, Yi Yu, Keizo Oyama “Personalized Music Recommendation with Triplet Network”(2019)

Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam. “Disentangled Multidimensional Metric LEARNING For Music Similarity.” (2020) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

One Shot Learning with Siamese Networks using Keras