Using deep learning to tag the genre, mood and instrumentation of an audio track
Over the last few years, neural networks have become the go-to technique for various music classification tasks, one of them being automatic genre tagging. Such networks, when trained properly, are powerful tools to enrich databases with metadata about audio files, i.e. tempo and musical key. At AI Music we need quick access to tools like these to use in our data pipelines. Examples of what a tagging tool could be used for: determining the bpm of a song in an automated ‘DJ mix’ in order to select a song of a fitting tempo that follows, or tagging and grouping our catalogue using key, tempo and other features.
Apart from obtaining musical key and tempo data, neural networks also give the possibility to train a system to detect the genre, mood and instrumentation of a musical piece. More accurately, it is the Convolutional Recurrent Neural Network (CRNN) that has achieved very good results in music classification. Given a big enough, accordingly labeled dataset, a Convolutional Neural Network (CNN) can be trained to be used to achieve a highly accurate music tagging tool.
In this article we will briefly explore two types of deep neural network architectures: CNN & CRNN and how they perform when trained using our datasets. Both the CNN and CRNN have been trained for the same tasks of classifying genre, mood and instrumentation. This has produced 6 models (3x CNN and 3x CRNN). The parameters of the architectures remained the same for each task, excluding some hyperparameters which have been optimised for each task individually.
Data and Features
The data that has been used to train both architectures (CNN & CRNN) consists of a large amount of Audio recordings with attached metadata containing the ground truths about the genre, mood and instrumentation of each recording. Among others, the IRMAS and MagnaTagATune datasets have been used for training. Each task has a different amount of classes: in our case the models have been trained on 30 different genres, 20 moods and 20 various instrumentation types. Here are some examples of classes for each task:
Genre: rock, indie, drum and bass, techno
Mood: sad, cheerful, eerie, euphoric
Instrumentation: tuned percussion instruments, electric guitar, violin, saxophone
Every class was randomly and evenly sampled to obtain a varied and non-skewed dataset. For faster training of the models, the data has been downsampled to 16,000 kHz and split into 30 second excerpts.
In order for the model to learn meaningful information from the audio data, it needs various features on which it can base its weight adjustments. The most common form of feature extraction method for audio classification tasks is the mel spectrogram, which is a 2-dimensional matrix containing information about features of the audio which the models can learn from. By using the famous fourier transform algorithm, we can transform the audio signal from the time domain into the frequency domain. By taking the entire frequency spectrum and performing a mathematical operation on it, we convert it to the mel scale, so that equal differences in pitches sound equally different to the listener.

Multi-label Classification Problem
A certain musical genre can have influences from other musical genres. A musical piece might have more than one mood associated with it. When detecting the most prominent instrument in a recording, you might want to know what the second, or even third detected instruments are. Multi-class classification can be defined as a problem where depending on an input x, the output is a distribution of probabilities for each class in y. By considering the tasks of tagging genres, moods and instruments as multi-label classification problems and assigning more than one ground truth to a high percentage of the audio tracks used for training, it is possible to get the probability values for i.e. the top 3 moods associated with a track.
CNN
Since the rise of neural networks, there have been many different types of network architectures. One of the best-performing is CNN, also known as ConvNet. This architecture resembles the human visual cortex and performs convolution operations which reduce the dimensionality of the incoming data (images, in our case spectrograms) while extracting important features for classification. In short, the architecture is able to capture spatial and temporal dependencies in a spectrogram by applying relevant filters.

CRNN
The CRNN makes use of the CNN architecture for the task of feature extraction, while using gated recurrent units (GRU) placed at the end of the architecture to summarise the temporal information of the extracted features. The GRU unit is a simplified version of the long short-term memory unit (LSTM) and has been chosen because of its quicker training time and comparable results. It is a layer which is added on top of the CNN architecture in order to know how much of the past information from previous time steps to pass along to the future, and which information to drop. This makes it possible for the network to focus on features which it deems important. **** The CRNN is based on the CNN architecture used before, with a smaller convolution kernel size and the GRU units added after the 5th convolution layer.

Hyperparameter Optimisation
Hyperparameter optimisation is an important step in the training process. By doing this we make sure that we choose the right optimisation algorithm for our model and data to achieve the best possible results. In order to determine the best optimisation algorithm for each architecture, grid-based search optimisation has been performed. This has been done for the two most popular optimisers: Adam, which is a variation of the Root Mean Square Propagation (RMSProp) algorithm and the Adaptive Gradient Algorithm (AdaGrad). The second optimiser that was chosen is Stochastic Gradient Descent (SGD). The grid search has been done on different values for both optimisers, which have then been used to train a small subset of the data, outside of the test and training sets. ADAM was chosen for two out of three of the models.
Results
One would assume that a CRNN, after its good performance in the Natural Language Processing (NLP) field, and its ability to extract and summarise temporal information from spectrograms, would outperform the CNN in all tasks (genre, mood and instrumentation tagging). It turns out that a well-optimised CNN outperforms the CRNN in this case, especially when it comes to genre and mood. One could speculate that during the optimisation process, a better optimiser and set of hyperparameters have been obtained for the CNN. Since the models were not far off each other result-wise, running a few experiments with various optimisation techniques could prove the CRNN to be more effective.
At AI. Music, in addition to examining the ROC-AUC scores, which are performance measurements, we like to evaluate a model by throwing big amounts of real world data at it. Below is a prediction made on Claude Debussy’s ‘Claire de Lune’ through our genre, mood and instrumentation taggers. The order of the tags is from the highest to the lowest confidence value.
GENRE: CLASSICAL (CNN)
MOOD: CALM, SPARSE, MAGICAL (CNN)
INSTRUMENTATION: PIANO, ORCHESTRAL ENSEMBLES (CRNN)
Even though the CRNN outperforms the CNN on certain examples, the CNN is better at generalising and performs better on bigger and more varied amounts of data, which is why it has been incorporated into our pipeline. Due to having less layers, the model performs its predictions faster. An interesting thing to note is that the three architectures, even though similar, required different parameters on optimisers in order to achieve good results. The causes of this are yet to be explored in another article.
Sign up to our newsletter at AI Music, to keep up to date with the research we do here and discover more about the company exploring the ways AI can help shape music production and delivery.
References
Keunwoo Choi, Gyorgy Fazekas, Mark Sandler. "Automatic Tagging Using Deep Convolutional Neural Networks". (2016) Proceedings of the 17th ISMIR Conference (New York City, USA)
Yoonchang Han, Jaehun Kim, and Kyogu Lee. "Deep convolutional neural networks for predominant instrument recognition in polyphonic music". (2016) Journal of LATEX Class Files, VOL.14, NO.8