Automated Guitar Transcription with Deep Learning

Using Convolutional Neural Networks to expedite learning music.

Published in

Towards Data Science

7 min readJun 21, 2019

This post outlines the implementation of automatic guitar transcription from audio files using Python, TensorFlow, and Keras as well as details the surface level methods performed. For training, the GuitarSet data set is employed for its large quantity of isolated guitar recordings with corresponding tabs. Please note that much of the direction in this project was provided by a research poster from NEMISIG 2019 found here.

Background

If you are familiar with Convolutional Neural Networks (CNNs), then you might have heard about their potential for image processing and analysis used for Computer Vision. It is this functionality of CNNs that we want to harness to output the guitar tab; therefore, it is first necessary to transform the input audio files into spectrogram images using the Constant-Q transform.

Why the Constant-Q Transform?

In order to understand the benefits of using the Constant-Q transform over the Fourier transform to select frequencies and create our input images, we must examine how musical notes are defined:

In the NOTE column, the musical note is identified by the letter on the left and the number on the right represents the current octave. Below is the plot of the frequencies (Hz) for the first six octaves of the musical note C.

We can see that each of the sequential octaves is twice the frequency of the previous octave. Since an octave spans twelve notes, we know that the frequency must double every twelve notes, which can be represented by the following formula [1]:

By plotting this relationship, we can see that the graph below displays an exponential curve:

Due to this exponential nature, the Constant-Q transform is better suited for fitting musical data than the Fourier transform, as its output is amplitude versus the log frequency. Also, the Constant-Q transform’s accuracy is analogous to the logarithmic scale and mimics the human ear, having a higher frequency resolution at the lower frequencies and a lower resolution at the higher frequencies [1].

Applying the Constant-Q Transform in Python

The Constant-Q transform can easily be applied to audio files in python using the libROSA library.

By passing through each audio file in the GuitarSet data set from a specified start time (start) through a duration (dur) and saving the output as an image, we can create the input images necessary to train the CNN. For this project, dur was set to 0.2 seconds and start was set to increase from zero to the length of each audio file by the set duration, which can yield the following:

Note that when being used as input images to the CNN, the color scheme was first converted to gray-scale.

Training Solutions

For each Constant-Q transform image, there must be a solution so that the network can adjust its guesses. Luckily, the GuitarSet data set contains all the notes played as MIDI values, time each note begins in the recording, and duration of the note for each audio file. Note: The following code snippets were placed in a function such that they could be used for every 0.2 seconds of audio.

First, the unique notes (retrieved as MIDI notes) being played during the 0.2 seconds of audio loaded must be extracted from the jams files.

There can only be six possible notes being played at one time (maximum of one notes on each string); therefore, code will often be repeated six times.

First, a matrix (6, 18) of MIDI values which represents the six strings and 18 frets of a guitar is created under variable Fret:

     [[40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57]
      [45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62]
      [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67]
      [55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72]
      [59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76]
      [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81]]

All possible locations of the unique notes retrieved on the guitar were then determined using Fret, where the matrix below shows a possible solution:

               [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
                [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1]
                [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0]
                [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0]
                [0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
                [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

All possible solutions for the combination of frets and strings must be determined. The idea of ‘finger economy’ is created — the lowest note of the chord, the root note, is compared to the rest of the notes in the chord where the number of frets (disregarding the string) each note is from the root note is summed to create a ‘finger economy’ number. The solution with the lowest ‘finger economy’ number is chosen as the correct chord shape.

Although this method does not always match the correct version of the chord being played in the recording, it does not negatively affect performance of the CNN as a C major chord played in the open position does not differ than C major played on the 8th fret.

The final solution is subsequently chosen using a combination of the strings and frets arrays in the final solution:

Additionally, in the first column for each row, if there exists a note (1 in the row), a zero is appended and vice versa if a note does not exist. This is done so the softmax function can still choose a category for strings without a note being played.

The previous code snippets return data such that the output is similar to a one-hot encoding of categories, the following matrix format was returned for each 0.2 seconds of audio:

             [[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
              [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
              [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
              [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
              [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
              [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]]

The above matrix is the guitar tab solution for one random 0.2 second selection from the GuitarSet data set. Each matrix shape is (6, 19) where the six rows correspond to each guitar string (eBGDAE from top to bottom). The first column identifies whether that the string is not being played, the second column identifies if the open string is being played, and the third through nineteenth columns identify the specific fret that is being played starting from the first fret. When training, this matrix is broken up into six separate arrays to train each head of the model.

Model Architecture and Training

The Keras functional API was used to create the following multi-task classification model with a 90/10 split in training and test data. The model features six tasks (eBGDAE strings) to determine if the string is not played, open, or a note is being played. For each of these six outputs, a softmax activation along with a categorical cross-entropy loss function are applied. Dropout layers are added to reduce overfitting.

The model was run for 30 epochs and the final accuracy for each string was recorded.

Results

The entirety of the GuitarSet data set audio files was not used for this model, however, a sufficient number of input files were used totaling 40828 training images and 4537 test samples.

The accuracy of each string was determined to be:

Test accuracy estring: 0.9336566011601407
Test accuracy Bstring: 0.8521049151158895
Test accuracy Gstring: 0.8283006392545786
Test accuracy Dstring: 0.7831165967256665
Test accuracy Astring: 0.8053780030331896
Test accuracy Estring: 0.8514436851100615

Which resulted in an average accuracy of 84.23%.

Closing Thoughts

This model is not yet ready to begin creating full length guitar tablature as a couple of issues still linger. The current model does not yet take into account the duration a note is held and will continue to repeat the tab for the duration specified in the code. Also, since chords can have different variations each containing the same notes, the model does not recognize when to use a specific voicing — which may prove inconvenient — but is not a significant problem. However, the model’s ability to correctly tab audio snippets is a fantastic development.

References

[1] C. Schörkhuber and Anssi Klapuri, Constant-Q transform toolbox for music processing (2010), 7th Sound and Music Computing Conference.