Hands-on Tutorials

Writing a music album with Artificial Intelligence

I created an Ambient Music EP using a number of Deep Learning generative Artificial Intelligence architectures

Nicolás Schmidt

Published in

Towards Data Science

11 min readApr 7, 2021

I have always believed that the fields of music and technology allow for an ecosystem of feedback that benefits both. On one hand technology provides musicians with new sounds, interfaces to interact with their creations, ways of composing that they had not explored before. On the other hand, the music itself, created by these new technologies, inspires the human mind to create new tools in order to interact with these new creations.

In my master’s program in Sound and Music Computing, I had the opportunity to learn about and explore various deep learning architectures for music generation. From Variational AutoEncoders, which operate in the spectrum of music generation in its symbolic representation, to more complex systems that operate on the waveform representation.

Having the chance to explore these networks, training them from scratch with my collections of audio and being able to generate new music made me deeply excited. This, because in my role as a composer, music producer, but also as a software engineer, opened up endless possibilities of expression that pushed my creativity at the same time that it restricted it.

Thus, in this article I intend to share and reflexionate about my creative process during the development of the final project of the Computational Music Creativity course.

Exploring Artificial Intelligence Deep Learning Architectures

At the beginning of the project I was not very clear about what I wanted to work on. The specifications for the final project were mainly at our discretion, but they had to at least make use of a few of the different technologies and resources learned in class. These could be: creation of interfaces for effects; creation of musical effects; training of some algorithm or deep learning architecture for music generation; use of already compiled tools for creative purposes; creation of patches in PureData with interaction with Python; among many other options.

Due to the free nature of the assignment, in the first instance I let myself be carried away simply by my curiosity. I wanted to know what was available out there. One thing was clear to me: I wanted to explore algorithms and deep learning architectures for the generation of unique music, sounds and textures. So, the first architecture I came across was the WaveNet.

Wavenet is a generative waveform architecture that works mainly through convolutional neural networks. It was created in 2016 by the London-based artificial intelligence company DeepMind. The WaveNet is able to generate audio sample by sample, by using convolutional filters that operate directly on the waveform domain.

Generation of new samples from the convolutional layers present in the WaveNet. Taken from Van den Oord, A. et. al.: WaveNet: A Generative Model for Raw Audio. 2016. Available at https://arxiv.org/pdf/1609.03499.pdf

WaveNet is used mainly for speech synthesis and text to speech tasks. This is why most of the open source implementations one can find are based on training on the VCTK Dataset or similar. The VCTK dataset is a set of recordings of English speakers. 110 English speakers with their respective accents were recorded reciting different short phrases in an anechoic chamber.

I tried out a few WaveNet implementations, but the one that gave me the best results was this one. My first experiment was to train it to “sing”. With this goal in mind I used the vocal stems inside the MUSDB18 Dataset. MUSDB18 is a dataset specially dedicated for source separation tasks. It consists of 144 professionally produced songs, where the tracks were reduced to 4 mixed stems: vocals, drums, bass, other (everything else) + the final mixed track. After training the wavenet on these tracks, I got my first generated sample:

Sample for the Wavenet implementation after 200.000 epochs over the vocals files in the MUSDB18 dataset.

As can be heard, the WaveNet was able to learn some very characteristic elements of the voice such as some consonant sounds. What stands out the most is the “s” sound, which can be heard in the second half of the track.

To be honest, for a first attempt I was not very satisfied with the generated track since it is not possible to appreciate pitch or other musical characteristics of the sung human voice. For this reason, I continued looking for other implementations of deep learning architectures that could be used for music generation. This is how I came across Google Magenta: “An open source research project exploring the role of machine learning as a tool in the creative process”.

At that moment I felt like a child playing with a new toy. Magenta has about 20 state-of-the-art models based on deep learning that can be used for various musical purposes; humanization of drums; generation of piano parts; continuation of melodies; chord accompaniment conditioned to melodies; interpolation between measures based on variational autoencoders; among others. Along with this, it has other models to perform tasks such as image stylization and generation of vectorized sketch-like drawings. It opens endless possibilities for creators!

Some of the Magenta models for music creation are implemented as standalone applications and Max for Live patches in Ableton. I highly recommend downloading and exploring them.

The two models that caught my attention the most were the PerformanceRNN and the Piano Transformer. The first one is a LSTM-based recurrent neural network designed to model polyphonic music with expressive timing and dynamics via a stream of MIDI events with learned onset, duration, velocity and pitch. On the other hand, the Piano Transformer is an autoregressive model capable of generating expressive piano performances learning long-term structures. The Piano Transformer was trained over ~10.000 hours of piano performances uploaded directly by YouTube users and converted to MIDI using Google’s Wave2Midi2Wave. Unfortunately, unlike the PerformanceRNN, the Piano Transformer model has not been released to the community yet, and we can only use the pre-trained model in this Colab Notebook.

If you want to try out the PerformanceRNN too, the Magenta team has published a Colab Notebook that you can play around with pre-trained models over the Yamaha e-Piano Competition dataset: around 1400 MIDI performances of professional pianists.

Motivation

Having experimented with the WaveNet, PerformanceRNN and Piano Transformer I was ready to make music. With the experiments I had done, especially with the PerformanceRNN, I was convinced that I wanted to preserve the voice of the machine. By this I mean that, regardless of the dataset with which one or another model is trained, when generating a sample you get a unique piece, whose author is diffuse. On the one hand, we have all the artists who were part of the dataset which the network was trained with. We also have all the researchers and developers involved in the implementation of the architecture and model. Finally there is me, as the user of this complex system, making decisions regarding the training set, the duration of the generated samples, and the subsequent arrangement and creation of a piece of music.

My intention with this exercise starts from pure curiosity. An exploratory process that allowed me to listen to textures and pieces created by this interdisciplinary team that did not agree to work together, but still happened. In this way, I wanted to explore through the creation of an EP, the different spaces that these new forms of musical creation offer us. The creativity that explodes when one is provided with textures and compositions never heard before, but that at the same time clashes with an ethic that suggests to take certain restrictions regarding the space of the “arranger”, or me, as a decision maker regarding what to leave untouched to preserve the voice of all the actors of the system, and what to modify.

This way, I challenge myself to create 3 songs. This EP will take the listener on a journey regarding the restriction of my own creativity, in favor of the creativity of the network and all the people behind it’s creation.

First Track: SatiEno

Talking about these technologies with the sound artist Nico Rosenberg (check out his amazing work), he suggested checking out the work of Brian Eno and Erik Satie. Brian Eno is considered one of the fathers of ambient music. On the other hand, Erik Satie was a composer and pianist from the XX century. He called his genre furniture music, in the sense that he created music not to sit and listen to, but to “decorate” the space.

Erik Satie even asked the audience at his concerts to talk with each other, to emphasize the furniture function of his music.

Examples of the PerformanceRNN network trained with increasing number of epochs. Last one is a very overfitter version of the Gymnopédie №1.

I compiled a dataset of MIDI transcriptions of Erik Satie’s compositions to train the PerformanceRNN from scratch. After about 16,000 epochs, the network learned to play piano in a highly expressive way. There were times when, perhaps because of the small number of compositions with which I trained the system, some recognizable sections of Gymnopédie №1 could be heard. For this reason I played with the temperature parameter in order to give the network a certain degree of freedom in its generation.

I finally chose a piece that I found particularly emotional generated by the system and decided that this was going to be the main melody of the first song.

Chosen sample for the first track generated by the PerformanceRNN

The piano track was great, but not enough to create a song. My intention was to be able to mix the melody generated by the PerformanceRNN with some other audio sample generated by a network that directly generates waveforms. This is why I returned to the WaveNet, but this time I trained it with almost 2 hours of Brian Eno’s music. I used the album Music For Airports and a 1 hour extended version of An Ending (Ascent).

Both albums were obtained from YouTube and the dataset was splitted in 16-seconds-length chops using the ./download-from-youtube.sh script provided by the SampleRNN implementation by Deepsound: https://github.com/deepsound-project/samplernn-pytorch

After 123.000 epochs the results were astonishing:

Chosen sample for the first track generated by the WaveNet

With this ambient audio file generated by the WaveNet and the piano track generated by the PerformanceRNN, I went to Ableton to start building the first song.

I wanted my role as an arranger to be limited. This way, the decisions were made trying to keep untouched as much as possible the samples generated by these the WaveNet and the PerformanceRNN. In summary, the main arrangements made on these two audios to build the track were:

Crop out the dissonant chords out of the audio generated by the WaveNet.
Using Ableton’s Scale object so that all the notes played by the MIDI piano melody generated by the PerformanceRNN were in the same predominant key of the melody generated by WaveNet: D major.
I copied the lowest notes of the piano melody to generate a Cello accompanying for certain parts of the song.
I copied some of the more characteristic notes of the piano melody to create a bells arrangement.
The audio generated by the wavenet was duplicated. The first copy remained unchanged and I panned it all the way to the left. The second copy was split in half and inverted the two fragments, so that the second half was at the beginning and the first half at the end, panned to the right. This track is the foundational ambient sound.
I duplicated the ambient track again and pitch shifted one scale up, to gain information on the treble (WaveNet samples at 16kHz, so there was no information over the 8kHz).
Added a drum track for the part where the main melody reaches a constant beat.
Added a vocoder track with the low chords of the piano melody and the ambient generated by WaveNet.
Added an electric guitar in two specific parts of the song.

Second Track: Response

For the second track my intention was that my decisions as arranger would be much smaller than in the first track, but at the same time there should be a coherent story and sound design. Thus, I used the Piano Transformer model to generate a continuation of the piano track used in the first song. I think this is especially interesting, because it means the interaction of two different architectures, and therefore a variation of the original, but now taking as composition style the Yamaha dataset.

The result was also remarkable, but unlike Satie’s trained PerformanceRNN sample, this piece did not have an easily recognizable beat, which made it ideal for a more ambient solo piano composition.

For this piece, however, I didn’t want to leave Brian Eno’s influence aside, so I took another sample from the WaveNet, but this time I trained it for a reduced number of epochs to give the composition a more raw atmosphere. The generated audio had a more constant pitch than the one used in the first song. However one can clearly hear a distortion of the signal, because if you don’t give a prior seed to the WaveNet, the audio is sculpted from white noise. I think the pops and clicks in the audio give a vintage vinyl touch to the composition, and add a lot of character to the song.

For this piece my interventions were more restricted:

I didn’t touch the audio generated by the WaveNet, I just duplicated it to create the stereo image as in the previous track.
I used again the Ableton’s Scale object on the piano piece generated by the Piano Transformer, this time using A major.
I duplicated and reversed the piano track. Added a lot of reverb and reversed it again. This created a very dreamy sound design.
The Cello melody is from the lowest notes of the piano track, with some adjustments to avoid dissonance.
Added some bells to emphasize some notes.

Final Track: Rhodes does not like Pop

For the last track I wanted to intervene as little as possible. Here I wanted to try something different than Satie + Eno. I wanted an audio 100% generated in the waveform domain that would allow me to explore new sound designs. In my exploration I tried to train 2 different implementations of the SampleRNN: 1 and 2. Neither of them gave me results besides audio pops that damaged my headphones. The repositories have not been maintained for many years and I could not find a fork that gave the results I expected.

I also tried OpenAI’s Jukebox. I found a Colab Notebook to test sampling from the Jukebox conditioned in genre, artist and lyrics. The results were amazing:

However, I wanted to train the network myself, with music and pieces that I found interesting. So finally I went back to the WaveNet and trained in parallel two networks. The first one using about 10 hours of DubStep music.

WaveNet trained over 10 hours of DubStep for 131350 steps.

WaveNet trained over 10 hours of DubStep for 519900 steps.

It is interesting to note how the WaveNet managed to learn the snare and kick drum sounds and many of the noises belonging to the breaks in this particular musical genre. However the result was so far conceptually from the other two tracks that I decided not to work with this sample.

WaveNet trained over 999999 steps on 6 hours of epic music.

The second attempt was using 6 hours of “epic” music. With this result I was quite frustrated, as I thought there would be a greater predominance of vocals, orchestra or string sounds. However, these last two experiments made me realize that at least this WaveNet implementation learns better when the timbral features of the dataset are similar between the files. This is why there are WaveNet implementations such as the WaveNet Tacotron2, which allows text-to-speech synthesis conditioned by the ID of the speaker in the VCTK dataset. In other words, to obtain better results, the audio segments on which the network is trained must share acoustic features.

This is how I arrived at my last attempt. Train the network over this:

Ikumi Ogasawara playing Pop songs in her Rhodes.

It seemed to me that training the WaveNet with the Rhodes performance, which is the main element of the first track, with Ikumi Ogasawara’s remarkable interpretation of well-known Pop songs, could give good results. However, it seems that the WaveNet does not like Pop, but more concrete music. For this last track the only arrangement was to decrease the gain of the sections where there was mainly noise.

Results

All tracks were mixed by me, so part of the mixing process included effects such as compressors, equalizers, delay, reverb and some automations. The final album can be heard below.

Bonus

Also, just for fun I trained the PerformanceRNN with a dataset consisting on Dream Theater’s songs. This overfitted model created some variations of well known pieces such as the intro of Trail of Tears and Lost Not Forgotten.