Generate Piano Instrumental Music by Using Deep Learning

Generate Piano Music “♪♪♪” step by step by experimenting Tensorflow v2.0 Alpha

Haryo Akbarianto Wibowo
Towards Data Science

--

Hello everyone! Finally, I can write again on my Medium and have free time to do some experiments on Artificial Intelligence (AI). This time, I am going to write and share about how to generate music notes by using Deep Learning. Unlike my previous article about generating lyrics, this time we will generate the notes of the musics and also generate the file (in MIDI format).

Photo by Malte Wingen on Unsplash

The theme of the music is Piano. This article will generate piano notes by using a variant of Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU) with the help of Self Attention. Not only this article will tell how to generate the notes, this article will also tell how to generate it into a proper MIDI files and can also be played in the computer.

This article is targeted for the one who is interested in AI, especially who want to practice on using Deep Learning. I hope that my writing skill will increase by publishing this article and the content benefits you 😃.

There is a Github link at the end of this article if you want to know about the complete source code. For now, I will give the python notebook and the Colaboratory link in the repository,.

Here is the opening music

Sound 1 : Opening Piano 😃

(That music is generated by the model that we will create in this article)

Outline

  1. Introduction
  2. Technology and Data
  3. Pipeline
  4. Preprocessing MIDI files
  5. Train Model
  6. Inference and Generate MIDI Files
  7. Results
  8. Conclusion
  9. Afterwords

Introduction

One of the current hot topic in the Artificial Intelligence is how to generate something by only using the data (unsupervised). In Computer Vision domain, there are many researchers out there researching some advanced techniques on generating images using Generative Advesarial Network (GAN). For example NVIDIA create realistic face generator by using GAN. There are also some research on generating music by using GAN.

Photo by Akshar Dave on Unsplash

If we talk about the value of the music generator, it can be used to help the musician on creating their music. It can enhance the creativity of people. I think in the future, if there are a lot of high attention on this field, most of musicians will create its music assisted by AI.

This article will be focused on how to generate music by generating sequential of notes in a music. We will know how to preprocess the data and transform them to be input of neural network to generate the music.

The experiment will also use Tensorflow v2.0 (still on alpha phase) as the Deep Learning Framework. What I want to show is to test and use Tensorflow v2.0 by following some of their best practice. One of the feature that I like in Tensorflow v2.0 is that it really accelerates the training of the model by using their AutoGraph. It can be used by defining our function using @tf.function. Moreover, there are no “tf.session” anymore and no global initialization. These efeatures are one of the reason that I moved from Tensorflow to PyTorch. Tensorflow usability was not good for me. Nevertheless, In my opinion Tensorflow v2.0 change it all and increase their usability to make it comfortable to do some experiment.

This experiment also use self-attention layer . The self-attention layer will tell us, given a sequential instance (for example in the music note “ C D E F G”), each token will learn how much the influence on other token to that token. Here’s some example (for an NLP task):

Image 1 : Visualization of attention. Taken from : http://jalammar.github.io/illustrated-transformer/

For further information about self-attention, especially about transformer, you can see this awesome article.

Without any further ado, let’s go on generating the music

Technology and Data

This experiment will use :

  1. Tensorflow v2.0 : Deep Learning Framework, a new version of Tensorflow which still in alpha phase of development.
  2. Python 3.7
  3. Colaboratory : Free Jupyter notebook environment that requires no setup and runs entirely in the cloud. Have GPU Tesla K80 or even TPU! Sadly Tensorflow v2.0 alpha still does not support TPU at the moment of this writing.
  4. Python library pretty_midi : a library to manipulate and create MIDI files

For the Data, we use MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) from Magenta as the dataset. This dataset only contains piano instruments. We will take 100 musics randomly from around 1000 musics to make our training time faster.

Pipeline

Here is the pipeline on how our music generator will works:

Image 2 : Pipeline

We will see each of the process. To make it simpler, we will divide each of the processes as follow:

  1. Preprocess MIDI Files to be Input of Neural Network
  2. Training Process
  3. Generating MIDI Files

Preprocess MIDI Files

Before we go into how to preprocess the midi files, we need to know what Midi format file is.

From pcmag , the definition of MIDI:

(Musical Instrument Digital Interface) A standard protocol for the interchange of musical information between musical instruments, synthesizers and computers. MIDI was developed to allow the keyboard of one synthesizer to play notes generated by another. It defines codes for musical notes as well as button, dial and pedal adjustments, and MIDI control messages can orchestrate a series of synthesizers, each playing a part of the musical score. MIDI Version 1.0 was introduced in 1983.

In summary, MIDI files contains a series of instruments which has notes in it. For example the combination of Piano and Guitar. Each of music instruments usually have different notes to play.

For preprocess the MIDI files, there are some libraries that can be used to do it in Python. One of them is pretty_midi. It can manipulate MIDI files and also create a new one. In this article, we will use that library.

For pretty_midi the format of midi files is as follow:

Image 3 : PrettyMidi format

Start is the start of a note played in second. End is the end of a note played in a second. There can be a overlap multi notes in a time. Pitch is the MIDI number of the Note played. Velocity is the force which the note is played.

For the reference of the relation between MIDI number and note name, you can see the picture below:

Image 4 : Midi Number with the Note Name. Taken from https://newt.phys.unsw.edu.au/jw/notes.html

Read the Midi Files

We will read midi files in a batch. This is how we read it using pretty_midi :

midi_pretty_format = pretty_midi.PrettyMIDI('song.mid')

We will get the PrettyMidi object.

Preprocess to Piano Roll Array

Image 5 : PrettyMidi to Piano Roll Array

For this article, we need to extract all the notes of the music from an instrument. Many MIDI files have multiple instruments in their music. In our dataset, the MIDI files only contains one instrument, which is Piano. We will extract the notes from the piano instruments. To make it easier, we will extract the notes for desired frame per second. pretty_midi has a handy function get_piano_roll to get the notes in binary 2D numpy.array in (notes, time) dimension array. The notes length is 128 and time follow the duration of the music divided by FPS.

The source code how we do it:

midi_pretty_format = pretty_midi.PrettyMIDI(midi_file_name)
piano_midi = midi_pretty_format.instruments[0] # Get the piano channels
piano_roll = piano_midi.get_piano_roll(fs=fs)

Preprocess to dictionary of time and notes

Image 6 : Piano Roll Array to Dictionary

After we get the array of piano roll, we convert them into dictionary. The dictionary will start from the time where the note is played. For example, in the picture above, we start from 28 (If we convert to second, assume we convert to piano_roll at 5 fps, the music start playing its notes at 5.6 s which we can get by 28 divided by 5).

After we create the dictionary, we will convert the values of the dictionary into string. For example:

array([49,68]) => '49,68'

To do it, we should loop all the keys of the dictionary and change its value:

for key in dict_note:
dict_note[key] = ','.join(dict_note[key])

Preprocess to list of music notes to be input and Target of Neural Network

Image 7 : Dictionary to List of Sequences

After we get the dictionary, we will convert it into sequential of notes which will be used to be input of neural network. Then we get the next timestep to be the target of the input of neural network.

Image 8 : Sliding window, taken from : https://towardsdatascience.com/generating-drake-rap-lyrics-using-language-models-and-lstms-8725d71b1b12

In this article, the length of the sequence list is 50. That means that if our fps is 5, we will get a sequence that contains 10 (50 / 5) seconds playtime.

‘e’ in the list means that there are no notes are played in that time. Since there are time where there is a jump or no notes are played between each played notes. In the example in Image 7, we can see that there is a jump from 43 to 46. If we convert that sequence, The list of sequence would be:

[ ... '61,77', '61,77', 'e', 'e', '73' , ...]

How we do it ? We will process the note in a batch of musics.

We use a 50 length sliding window. For the first note in a music, we will append ‘e’ to the list 49 times. Then set the start time to the first timestep in the dictionary. In the example in Image 7, it is 28. Then we append the first note in that music (In the example ‘77’).

Then for the next instance, we slide the window by one and append ‘e’ to the list 48 times and append the note played in timestep 28 and append the note in timestep 29 and repeat until the end of the music.

For the next music, we repeat the process above.

This is the source code:

Create Note Tokenizer

Before we dive into the neural network, we must create the tokenizer to change the sequential notes into sequential index of the notes. First we should map the note into an index representing the id of the note.

For example:

{
'61,77' : 1, # 61,77 will be identified as 1
'e' : 2,
'73' : 3,
.
.
}

So if our previous input is as below:

[ ... , '61,77', '61,77', 'e', 'e', '73' , ...]

We convert it into:

[ ... 1, 1, 2, 2, 3 , ...]

This is how we do it.

To summarize our Preprocessing function, here are functions that we will use:

Train Model

Before we get to know how to train with the new feature of Tensorflow v2.0, we will see the architecture as follow:

Neural Network Architecture

Image 9 : Our Neural Network Architecture

So, the Deep Learning architecture will use 3 layers of Gated Recurrent Unit (GRU, a variant of Recurrent Neural Network) and some Self Attention Layers. The dropout is used so that the neural network will not overfit so fast.

For the Self Attention Layers, we will use this repository and edit it a little so that we can use it on Tensorflow v2.0.

The code:

Training

We will update the weight of the model by iterating a number of musics in the dataset and preprocess the data as stated above. Then we take a number of instances in a batch to be input and the target of the neural network.

We will use GradientTape on updating the weight of the neural network. First we compute the loss and apply the back propagation using apply_gradients on it. If you are familiar on using PyTorch, this is how Pytorch works on training its neural network model.

Be sure to use @tf.function on the function. This will convert the function into autograph and make our training faster. One of the downside of tf.function cannot use different size of batches as the input of neural network. For example, our batch size is 64. If the size of datasets is 70, the last batch will contains 6 instances. This will throw exception to the program as the graph will have an input with different size from the initial graph. Maybe it works by creating placeholders by seeing the first input when using the function.

In this article, we will use 16 BATCH_SONG andd 96 BATCH_NNET_SIZE. That means we will take 16 musics from the list of all musics, then extract its sequence. Then for every step in the neural network, we take 96 sequences from the extracted sequence instances to be input and target of neural network.

The code is as follow:

Inference and Generate MIDI Files

Image 10 : Inference and Generate MIDI files

There are two ways on generating a MIDI file using our trained neural network model:

We need to choose at the start:

  1. We generate random 50 notes as the start of the music.
  2. We use 49 empty notes (‘e’) followed by a start note of our choice (example ‘72’, make sure the note is in the NoteTokenizer).
Image 11 : Visualization on how the generator works

After we choose the seed of our music generator, we predict the next note based on the 50 random notes using our trained model. We use the predicted value as the probability distribution on how to choose the note randomly. We do it until the designated maximum sequences length that we want. Then we drop the first 50 notes.

After we generate a list sequence of notes, we will convert it again to piano roll array. Then transform it into the PrettyMidi object.

After that, we tweak the velocity and tempo of the music and finally we generate the MIDI file.

The code :

This is how to write the midi files from the generated notes:

Results

When I did it, the training took 1 hour for 1 epoch. When I did it, I decided to run the training for 4 epochs (4 hours).

From the model that have been trained for 4 epochs, here are the results:

Generate From Random 50 Notes

Sound 2
Sound 3

Generate From a Note

Sound 4
Sound 5

( Note that these are the mp3 file converted from MIDI files. I use online converter to do it. The notes seems a bit miss from the original. I will upload the original MIDI in the repository if you want to hear it. )

There is a visible difference between these generated notes. If we generate it from a note, it will have a slow tempo start on playing the notes . It is different from when we generate it from a 50 random notes. It does not have a slow start.

This is the visualization of the self attention block on last sequence of the music on choosing start with random 50 notes:

First Attention

Image 12 : First Self Attention

Second Attention

Image 13 : Second Self Attention

As you can see, the first self attention block learn what note to focus for every note in a sequence instances. Yet, there are no result on what to focus in the second attention block. We can also tell that if the others note’s position is very far from the current note, it will not focus to it (The black color in the Image 12 and Image 13).

Conclusion

We have built a tool to generate musics with MAESTRO dataset that contains piano music. We preprocess it, train our neural network model, then generate the music with it. The musics are in MIDI format. We use Tensorflow v2.0 to do it. I think Tensorflow v2.0 User Experience (UX) is better than its previous version.

The music generated by our model is also coherent and good to hear. It can adjust how it play its notes. For example : When the generator From a note (means it is a start of the music), it start with a slow tempo start.

There are some things that we can try for the music generator. In this article, we have experimented on generating a single instrument. What if the music have multiple instruments? There need to be a better architecture to do it. There are multiple things that we can try on experimenting music data.

Afterwords

Photo by Alan Chen on Unsplash

That’s it for this article about generating piano music notes. Actually, I got inspired to write this by looking my first article about Deep Learning, which is generating music’s lyric. “How about generate the music notes?”. I experiment it and well… it works.

There are some struggle for me to experiment this. First, I need to search what file format that is easy to preprocess and to be input of neural network. I found that MIDI is easy and have small size of file. Then, I need to know are there any libraries that can preprocess the file in Python. I found two, there are music21 and pretty_midi where their repository is not outdated. I choose pretty_midi. Maybe because it has ‘pretty’ in its name 😝. Finally, I need to think how to preprocess the note. Thankfully, pretty_midi has a handy function get_piano_roll to make it easier.

I also haven’t read many research papers about music. Maybe there are research papers that can be reproduced and visible to be done in Colaboratory.

I’m sorry for the lack of the visualization on the Self Attention Layer.

I welcome any feedback that can improve myself and this article. I’m in the process of learning on writing and learning about Deep Learning. I apreciate a feedback to make me become better. Make sure to give feedback in a proper manner 😄.

See ya in my next article!

Source : https://cdn.pixabay.com/photo/2017/07/10/16/07/thank-you-2490552_1280.png

Repository

Source

https://cs224d.stanford.edu/reports/allenh.pdf

--

--

Mad AI Enthusiast. I write mostly about Artificial Intelligence and Self Development. I also love to read Engineering, Psychology and Startup. Love to share!