The world’s leading publication for data science, AI, and ML professionals.

Music Generation with ConvLSTMs

With Playable AI generated Music Files

Photo by Rajesh Kavasseri on Unsplash
Photo by Rajesh Kavasseri on Unsplash

This project is a continuation of my other project, because I came up with an idea to generate even better results. Instead of using GANs, I will use LSTMs. This would make training faster and would allow for an objective evaluation function to evaluate the model’s performance.

Trained on music composed by human composers, the model is trained to use the past notes in order to predict the next note.This results in the model finding a pattern between past notes and the next note, that will allow for the generation of original music

Here is a generated extract that I particularly enjoyed:

LSTM 2 is interesting because the ending of it sounds like it is about to enter a perfect cadence, with the chords going IV-V-I. The chords before this possible sequence are also very characteristic.

With this music at the back of your mind, let’s start constructing the program:

Data Preprocessing:

The first step to Machine Learning is the data preprocessing. For this project, it consists of 2steps:

Access Midi Files:

I found a dataset of classical compositions online, scraped from an online website. I extracted all the midi files and put them into a folder.

Convert Midi Files into images:

I found a github page that had 2 programs that used the music21 library to convert midi files into images and back.

Each note can be represented as a white block. The height of the block defines the pitch, and the length defines how long the note is played.

I then wrote a script to integrate these two programs with my midi files, to create new images in a different directory:

import os
import numpy as nppath = 'XXXXXXXXX'os.chdir(path)
midiz = os.listdir()
midis = []
for midi in midiz:
    midis.append(path+''+midi)

This script goes to midi directory, and then adds all the midi file paths to a list, to be accessed later.

from music21 import midimf = midi.MidiFile()
mf.open(midis[0]) 
mf.read()
mf.close()
s = midi.translate.midiFileToStream(mf)
s.show('midi')

This script opens the first midi file, and plays it to make sure that the program is working. This might not work if you run this in a non-interactive environment.

import os
import numpy as np
import py_midicsv as pmos.chdir(path)
midiz = os.listdir()
midis = []
for midi in midiz:
    midis.append(path+''+midi)

new_dir = 'XXXXXXXX'
for midi in midis:
    try:
        midi2image(midi)
        basewidth = 106
        img_path = midi.split('')[-1].replace(".mid",".png")
        img_path = new_dir+""+img_path
        print(img_path)
        img = Image.open(img_path)
        hsize = 106
        img = img.resize((basewidth,hsize), Image.ANTIALIAS)
        img.save(img_path)
    except:
        pass

This script uses the midi2image function from the github page and converts all the midi files, given the path to the midi files. They are also reshaped into the shape (106,106). Why? 106 is the height of the program, as this is the number of possible notes on a midi file. Also, it is much easier to work with squares for convolutional transpositions.

Construct Dataset:

import os
imgs = os.listdir()
pixels = []
from PIL import Image
import numpy as np
for img in imgs:
  try:
    im = Image.open(img).rotate(90)
    data = np.array(im.getdata())/255
    pix = (data).reshape(106,106)
    pixels.append(pix)
  except:
    pass

This script goes through the directory, and records all the image data. Note that all images need to be rotated 90 degrees, as this would allow for the getdata function to access data in time order, instead of pitch order.

def split_sequences(sequence, n_steps):
    X, y = list(), list()
    for i in range(len(sequence)):
        end_ix = i + n_steps
        if end_ix > len(sequence)-1:
            break
        seq_x, seq_y = sequence[i:end_ix-1], sequence[end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return np.array(X), np.array(y)
X = []
y = []
for i in range(len(pixels)):
    mini_x,mini_y = split_sequences(pixels[i],10)
    X.append(mini_x)
    y.append(mini_y)

This script constructs the X and y lists, from the time series data. The use of the variable mini_x is used so that the X list consists of individual sets of 9 notes, and the y list consists of individual notes that map to each set of 9.

X = np.array(X)
y = np.array(y)

Converting both X and y lists into NumPy arrays will allow for no errors, when the data is fed into the model.

X = X.reshape(len(X),1,9,106)
y = y.reshape((y.shape[0]*y.shape[1],y.shape[2]))

This script reshapes the X and y arrays so that it fits the ConvLSTM. The y values will be 106 values of 1s and 0s. A 1 would denote that a note will be played at that pitch, while a 0 would denote that no note of that pitch will be played for that timestep.

Constructing the ConvLSTM:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout,BatchNormalization
from keras.layers import LSTM,TimeDistributed
from keras.layers.convolutional import Conv1D,MaxPooling1D
model = Sequential()
model.add(TimeDistributed(Conv1D(filters=128, kernel_size=1, activation='relu'), input_shape=(None, 9, 106)))
model.add(TimeDistributed(MaxPooling1D(pool_size=2, strides=None)))
model.add(TimeDistributed(Conv1D(filters=128, kernel_size=1, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(pool_size=2, strides=None)))
model.add(TimeDistributed(Conv1D(filters=128, kernel_size=1, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(pool_size=2, strides=None)))
model.add(TimeDistributed(Conv1D(filters=128, kernel_size=1, activation='relu')))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(128,return_sequences = True))
model.add(LSTM(64))
model.add(BatchNormalization())
model.add(Dense(106,activation = 'sigmoid'))
model.compile(optimizer='adam', loss='mse')

This is the full model architecture used. I have found that this model is a very versatile. This model is the same model that I used to predict stock prices, based on historical data. The main difference is that the last layer uses the sigmoid function with 106 nodes, as each timestep must be described as 106 notes and if they will be played or not.

model.summary()

When calling this function to see the model architecture we get this:

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_25 (TimeDis (None, None, 9, 128)      13696     
_________________________________________________________________
time_distributed_26 (TimeDis (None, None, 4, 128)      0         
_________________________________________________________________
time_distributed_27 (TimeDis (None, None, 4, 128)      16512     
_________________________________________________________________
time_distributed_28 (TimeDis (None, None, 2, 128)      0         
_________________________________________________________________
time_distributed_29 (TimeDis (None, None, 2, 128)      16512     
_________________________________________________________________
time_distributed_30 (TimeDis (None, None, 1, 128)      0         
_________________________________________________________________
time_distributed_31 (TimeDis (None, None, 1, 128)      16512     
_________________________________________________________________
time_distributed_32 (TimeDis (None, None, 128)         0         
_________________________________________________________________
lstm_6 (LSTM)                (None, None, 128)         131584    
_________________________________________________________________
lstm_7 (LSTM)                (None, 64)                49408     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64)                256       
_________________________________________________________________
dense_2 (Dense)              (None, 106)               6890      
=================================================================
Total params: 251,370
Trainable params: 251,242
Non-trainable params: 128
_________________________________________________________________

You can toy around with the hyperparameters in the model, to see how it affects the results of the model.

model.fit(X,y,epochs = 100)

This script trains the model for 100 epochs. Note that no validation data is needed as the model doesn’t actually need to 100% accurately predict the next note. It is simply supposed to learn a pattern from the source music, so as to create a plausible next note.

Hearing Results:

song_length = 106
data = X[np.random.randint(len(X))][0]
song = []
for i in range(song_length):
    pred = model.predict(data.reshape(1,1,9,106))[0]
    notes = (pred.astype(np.uint8)).reshape(106)
    print(notes)
    song.append(notes)
    data = list(data)
    data.append(notes)
    data.pop(0)
    data = np.array(data)

This function allows for the model to generate its own music, based on its own music. This is how it works:

In order to kickstart the process, a random instance from the dataset is necessary for the model to make a prediction. The data list functions like a deque: every time a new prediction is made, it is added to the end of the list. In order for the shape of the input data to be the same every iteration, it is necessary to remove the first instance. By collecting the predictions, it shows the piece generated is purely computer and does not contain any information from the original instance.

new_image = Image.fromarray(np.array(song)).rotate(-90)
new_image.save('composition.png')

After that, we will rotate the image back 90 degrees, so that the image2midi function can work properly.

image2midi('composition.png')

We then convert the image into a midi file, and we can use the code below to listen to it (at least in a colab notebook):

!apt install fluidsynth
!cp /usr/share/sounds/sf2/FluidR3_GM.sf2 ./font.sf2
!fluidsynth -ni font.sf2 composition.mid -F output.wav -r 44100
from IPython.display import Audio
Audio('output.wav')

Let’s hear the initial results:

Nothing.

There are no results.

Artificially altering the Results:

When checking the results, the model is not able to come up with values larger than 0.6. As NumPy rounds anything lower than 0.6 down, there are no actual notes being played. I searched everywhere for ways to change the model architecture but there wasn’t anything online.

The solution that I came to was reasonably unorthodox: manually altering the results.

The problem with the network is that the values predicted are too small. If we added a certain value to the predictions, it would be able to show actual notes.

How does this not interfere with the model’s predictions? This is because we can visualize the predictions of the model not as played or not played, but as how well this note fits the current timestep. This is backed up by the fact that the model is trained with the MSE loss function: the difference between the predicted value and the actual value matters. As the initial values predicted by the model is retained, the model’s prediction are simply enlarged.

Here is the improved generation function:

song_length = 106
data = X[np.random.randint(len(X))][0]
song = []
vanish_proof = 0.65
vanish_inc = 1.001
for i in range(song_length):
    pred = model.predict(data.reshape(1,1,9,106))[0]+vanish_proof
    vanish_proof *= vanish_inc
    notes = (pred.astype(np.uint8)).reshape(106)
    print(notes)
    song.append(notes)
    data = list(data)
    data.append(notes)
    data.pop(0)
    data = np.array(data)

As you can see, there are two new variables of vanish_proof and vanish_inc. Vanish_proof is the value that is added to all predictions, and vanish_inc is the rate in which the vanish_proof increases. This is necessary as each new prediction of the network would be based upon past predictions. If the predictions were small, this effect would propagate forward, making the song slowly fade out.

This created much better results, like those shown at the beginning of the article. Here is another extract that I enjoyed:

Conclusion:

I think that the most insightful part of this article is the artificial altering of the results, to directly draw out fading results. I haven’t really seen this being used in other sources, and I am not sure of how good or effective this practice is. You can toy around with the vanish_proof and vanish_inc parameters. Too high and the piece is interrupted with misplaced notes.

My links:

If you want to see more of my content, click this link.


Related Articles