

Creative content is the most complicated thing for computers to replicate: a very low loss would mean that the output of the network would not be original, while a very high loss would mean that the output of the network would have no resemblance to the source material. Because of this, it requires a lot of hyperparameter tuning to get valid results.
The Concept:
The Bach Chorale dataset is an extensive dataset, containing the note values from 381 of Bach’s chorales. Each chorale has been broken down into a CSV files, with 4 columns. Each column contains the note value for each of the 4 voices in the chorale (soprano, alto, tenor and bass).
You can find the dataset here.
My goal is to convert this data into spatial data that gives the network the ability to link past notes to make decisions for the next notes. My initial impression was to convert the files into images, as images work well for many different types of networks.
The network would be given corrupted data (data with missing information) and would try to fill in the blanks.
Data Collection:
This script imports all the libraries that are needed to run the program. Numpy is here for matrix manipulation, PIL for Image and pickle to store list values in local files, to save time in the future. Array2img is a special function that is used to convert the Bach Chorale arrays into images. It takes a blank image (fully black) and adds white pixels if there are notes there.
This script looks through all the files in a directory. Use os.chdir() to move to the directory in which all of the data is stored. The values are split into 106 by 106 arrays. This is as the range of notes in midi files are 106, and square images are easier to manipulate than rectangles. This is as it is easier to deal with in terms of kernel_size in Convolutional nets. This data is then pickled for future use.
This script converts the csv file into an array that can then be converted into an image. Along with another script to convert image files to midi files, this should work perfectly, right?
Problem 1: Transformation
I got very confused about the transformations to convert the image created to be input into the img2midi function.
Eventually, after a few hours of testing, I found that rotating the image 90 degrees counterclockwise gave the best results.
Here is an example of a Bach chorale in image form:

Problem 2: Pitch
Even after completing the project, I am not exactly sure what caused this issue. The notes that the function outputted were just all wrong. When printing the values that the computer used, it did not match up with the values in the csv file.
I had to rework the img2midi script so that I could input the values from the csv directly into the function.
The original code for this script is not mine. You can find it here.
I feed the notes from the csv directly into the image. The notes are a list named "notesy".
By inputting the values directly instead of using an image, it prevents the occurrence of wrong pitches.
Using another midi2img, I converted the midi back into an image and then compared the two images.
I observed the difference and made a few adjustments to the array2img code.
After many difficult experiences with images and midi files, I decided that I would just use the the midi notes themselves as the inputs and outputs of the network. It could work out in the long term, as the image contains a lot of redundant values. 4 channel inputs would also cause the network to output a 4 part harmony, which is exactly what I am looking for.
Data Synthesis:
Random Corruption:
The first idea that came to mind was a random removal of certain notes. This scr
The corrupt coefficient is used to calculate the number of values that will be randomly removed from the chorale. This is calculated by multiplying the number of notes in the chorale by the coefficient.
After a little bit of testing, it ended up that this method did not work very well, as the time steps defined in the dataset were very short, and so the script just split the notes into more bits. This is too easy for the model to extrapolate and therefore useless data.
Here is a sample of the data generated by this algorithm:


Block by Block:
This script uses a different strategy to do this. It creates blocks of a defined size, and clears the chorale vertically. This means the notes for all 4 voices of the chorale are removed within a short snippet of the chorale.
Instead of knowing the exact number of corrupted notes, probability is used to determine how much the image is corrupted.
The gist on the left also features a script to generate a dataset with a defined number of datapoints.
Here are the corruptions made by this algorithm:


Line by Line:
This script removes all voices in the chorale except for the bass line. This line to be kept can be determined with the line_n variable. As python uses 0 based indexes, 0 would refer to the soprano and 3 the bass.
Here is a sample of the images generated using this algorithm:


After some testing, I decided that the best way to define the data would be to express it as arrays of size (106,4). There would be 4 channels for each of the voices, and the chorale would be cut into segments 106 timesteps long.
Models
Assuming that you have saved all of the pickled files in your current working directory, the code on the left will prepare all the block based data.
It also has 2 other interesting functions:
Fill in block:
Fill in block takes in the output of the network and removes all overlaps of the model’s output to create a full of the model’s output. This can be to the raw output for more information.
Array2img analysis:
Array2img analysis allows the conversion of an array2img along with the color of the pixels. This makes it easy to tell which notes are generated by the computer.
This baseline model only uses dense models and the relu activation function. The fact that it only uses dense layers makes it difficult to learn more spatial patterns. Although this usually results in slower convergence , it results in little to no learning in this case.
Here is a sample of the results:



This model is similar to the baseline model, with added dropout layers. I thought that most of the flat lines created by the baseline models were due to the best prediction to be the same 4 notes. This would therefore mean that dropout layers would result in a higher number of variance, and thus better results.
The dropout could be more controlled but this was merely a proof of concept.
This is the first model that involves the 1-dimensional convolutional layers that I was intending to use in the very beginning.
This is a basic convolutional network, with some basic hyperparameter tuning. The kernel_size is well optimized: 4 fits very well for the number of channels and results in a faster convergence than other values.
When working on the best model architectures, I came across the ideas of GANs as another valid model architecture.
The perfect GAN for this project would be a Pix2Pix GAN. The problem is this would take a long time to code and I decided that I didn’t want to spend too much time on this project and decided to drop the idea.
However, I managed to create an encoder-decoder setup similar to that of the generator in a Pix2Pix GAN.
The encoder decrease the dimensions of the input until it has a smaller representation of the same image. The decoders then upscale the representation, filling in the holes in the process.
Additionally, there are skip connections between the encoder-decoder blocks, allowing for information to pass from different levels of the process.
Here is a sample of the results:



Conclusion:
I think that the project was mostly successful. However, using a Pix2Pix GAN for this project would be very interesting:
A Pix2Pix GAN would be able to generate much more original and interesting results, as the generator loss would not be just the MSE between the ground truth and the generator output. This would, in theory, generate truly original results.
I hope use a Pix2Pix GAN for this project in the near future.
My links:
If you want to see more of my content, click this link.