Hands-on Tutorials

Jazz music generation using GPT

Generating jazz music using GPT and piano roll encoding methods

Viet Pham

Published in

Towards Data Science

7 min readJan 6, 2021

The Generative Pre-trained Transformer or GPT model has achieved astonishing results when dealing with Natural Language Processing (NLP) tasks. However, the model architecture is not exclusive to NLP and has been utilised to solves other problem such as time-series prediction or music generation. In this article, I will share with the approach that I used to generate music (specifically Jazz music) using a very simple version of the GPT model.

Background
Piano roll encoding approach
Data preprocessing and analysis
GPT Model
Training
Inferencing
Web App
Conclusion

Background

The music generation task has been tackled in the past using deep neural networks such as RNN specifically LSTM or CNN and most recently Transformer. In someway, the approach to solving this problem is greatly influenced by NLP as the structure between musical notes in a song and text in a paragraph are fairly similar. The main difference when dealing with music as compared to text is in the information encoding step which we will explore later on.

In addition, the reason why I chose the Jazz genre is because of the unpredictability in it (and also Jazz is one of my favourite music genres). As they often say there is no “wrong” note in Jazz, especially during improvisation, I am curious to see how the predicted music will sound like when trained on Jazz dataset. For this project, we will only focus on generating Jazz using the Piano instrument.

All of the preprocessing and training notebooks can be found at the end!

Piano roll encoding approach

Generally, midi objects contain quite a few numbers of information and depend on which library you use, you could extract different data or the same data but in different formats from a MIDI file. A MIDI note, on the other hand, is quite standard as it is the building block of a MIDI song. It contains the following information: velocity, pitch, start time, end time.

Like I mentioned previously how we are transforming an NLP model to solve music generation task, we need to encode these MIDI note information in a way that is relatively similar to how words are encoded in NLP (which often use index-based encoding into word embedding). However, for MIDI, we have up to 4 features we need to represent per note, which includes time features as well.

Index based encoding and embedding for word — Text Encoding: A Review[1]

To solve this, we decide to convert the MIDI into piano roll format with a sampling interval of every 16th note. As such the piano roll is a 2D array of size (song_len, 128) where song_len is the total number of 16th notes in the song and 128 is the number of possible pitches in a MIDI song.

An example of a MIDI stream (left) being converted into Piano roll array (right)

This data encoding approach represents the music note for every constant time interval, thus, allowing us to represent the whole song into a compact 2D array. From here, we can carry out a similar approach to word encoding which is index-based encoding every combination of pitches then feed them into an embedding layer.

We decided to not include the velocity features as this will cause our pitch combination vocabulary to explode. And the 16th note was the optimal interval as it can represent the music details accurately enough while also keeping our piano roll array from getting too stretched out.

Having understood the approach, let’s dive into the code!

Data preprocessing and analysis

For our dataset, we chose the Doug McKenzie Jazz Piano dataset. Although there are only about 200 MIDIs from this dataset, it contains a great variety of popular jazz songs while the piano parts are generally clean, cohesive and have very few missing sections.

Since all the songs in the dataset have different key signatures and are played in different BPM, we carry out data preprocessing in order to normalise these features. This normalisation step is important as it not only allows the model to understand the structure and patterns of the song better but also helps to reduce the vocabulary size for our model later on.

We used Python pretty-midi and music21 to aid the data parsing and processing steps. To extract the piano part, we filtered out the streams that contain the most number of notes (as this is often the case for piano streams). The extract_midi_info() function will help us get the key signature we need to offset by as well as the bpm for the piano roll.

The preprocess_midi() function will help us get the piano roll array using pretty_midi get_piano_roll function.

For the last step of preprocessing, we loop through all the MIDI in our dataset, parse, pre-process and save the piano roll array as .npy format.

GPT Model

Having encoded our data, we can now feed it into the GPT architecture to train an autoregressive model. If you are not so sure how GPT works, I recommend reading this blog post [2] by Jay Alammar, it is extremely detailed and insightful and I learned a lot from it.

To briefly summary, GPT utilises solely the decoder block of the transformer architecture and it stacks these decoder block on top of one another to increase the complexity of the network.

OpenAI's GPT-2 | Building GPT-2 AI Text Generator in Python — GPT architecture consists of stacked transformer decoder blocks — Improving Language Understanding by Generative Pre-Training

The following code for the GPT model was referenced from Text Generation with miniature GPT [3] by Apoorv Nandan.

For the embedding of token and position, we used the sine and cosine function.

Positional encoding — Attention Is All You Need [4]

Self-attention with casual masking

Transformer block

The final model.

Model summary:

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 600)]             0         
_________________________________________________________________
token_and_position_embedding (None, 600, 128)          5120000   
_________________________________________________________________
transformer_block (Transform (None, 600, 128)          99584     
_________________________________________________________________
transformer_block_1 (Transfo (None, 600, 128)          99584     
_________________________________________________________________
transformer_block_2 (Transfo (None, 600, 128)          99584     
_________________________________________________________________
dense_18 (Dense)             (None, 600, 40000)        5160000   
=================================================================
Total params: 10,578,752
Trainable params: 10,578,752
Non-trainable params: 0

Training

To train our model, we first need to create training input and output from our dataset. Hence, we start by assign each unique pitch combination to an integer as mentioned in the previous part. For our vocabulary, we only consider the top 40000 most common note combination used including the unknown and padding tokens. Lastly, we tokenise the piano roll array of our input data.

We will also create our custom generator. The input for the model will be a sequence of tokens of fixed sequence_length and the output will be a sequence of the same length but shifted one token to the right. For each epoch, we will iterate through all of the songs in our dataset, at each song we picked a pair of input and output sequence to train on at random position. We also apply padding if needed.

Finally, we are ready to train! We experimented with 1500 epochs, 32 batch size and sequence length of 600.

Inferencing

To predict a song, we need to feed it some starting note and pad it to have the same size as training sequence_length, which is 600.

After getting the piano roll, we can subsequently convert to MIDI. Here are a few inference samples of which the first 5 or 10 seconds are the seed.

This song has seed for 10 seconds

Web App

My group and I also created a web app to demonstrate our model capability. We used React JS for the web app and a Flask server for the inferencing task. To deploy the product, we use Google Cloud service to host both of our React web app and Flask inference server.

Here is a short video demo of the web app, unfortunate the app itself is no longer available due to my Google Cloud Platform credit has ran out :(

Demo of the web app

Conclusion

Despite a small dataset of roughly 200 songs, we managed to develop a model that can predict jazz music quite well. Nevertheless, there are still some signs of overfitting during the evaluation process although we tried to increase the dropout rate, we believe this problem will subdue given a larger dataset.

Lastly, this was my first Data Science projects and it really taught me a The project was done for the Computational Data Science module. Our group consist of Elliot Koh, Sean Lim, Sidharth Praveen and me. Special thanks to my groupmates and my professors Dorien Herremans and Soujany Poria for their help and guidance.

Source code:
- Notebook for preprocessing
- Notebook for training and inferencing
- Project Github

References:
[1] Silipo, Rosaria. “Text Encoding: A Review,” November 21, 2019. https://towardsdatascience.com/text-encoding-a-review-7c929514cccf.

[2] Alammar, Jay. “The Illustrated GPT-2 (Visualizing Transformer Language Models).” The Illustrated GPT-2 (Visualizing Transformer Language Models), August 19, 2019. http://jalammar.github.io/illustrated-gpt2/.

[3] Nandan, Apoorv. “Keras Documentation: Text Generation with a Miniature GPT.” Text generation with a miniature GPT, May 29, 2020. https://keras.io/examples/generative/text_generation_with_miniature_gpt/.

[4] Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need. arXiv 2017.” arXiv preprint arXiv:1706.03762 (2017).