Written by Philippe Petit, Gatien Vassor, Emna Bairam and Wladimir Raymond – April 2021

1. Context
The Pyzzicato project is the final part of the validation of our three-months Data Scientist bootcamp at Datascientest.
This project aims at building a web application that allows users to generate novel music tracks based on the style of the composer they choose (using streamlit). Basically, the models automatically learn musical styles and generate new musical content based on existing compositions by using a set of Deep Learning techniques and a midi-files dataset.
In the realm of deep learning algorithms, Generative Models have a peculiar goal: they aim at creating original output, meaning "not an exact copy" of what they have learned to reproduce. The output must be original in that sense, but not excessively original: the judges (here: us humans) must at least recognize the main characteristics of the model training input data in the generated output.
In a nutshell: good Generative Models are the "spiritual children" of their input datasets, but are not expert counterfeiters. A Generative Model that perfectly reproduces its input data fails his mission, as the generated content already exists.
The well-known website thispersondoesnotexist.com is an excellent illustration of this definition, as it produces images of generic humans -not existing humans.
In the context of music generation, one potential goal of Generative Models is to learn to imitate the "style" of the music data they have been fed with. The listener of the generated musical output must feel like it could have been written by a human composer and recognize a style or a musical genre.
2. The project
Dataset and data exploration
Generating music is no easy task, even for humans, and a music track is a very complex object. To simplify the modelization, we have chosen to generate mono-instrumental music and have thus selected a dataset containing recordings of mono-instrumental piano performances. We have used the MAESTRO dataset (MIDI and Audio Edited for Synchronous TRacks and Organization) which has been created as part of the Magenta project.
The dataset is composed of 1276 tracks covering over 200 hours of piano recordings of composers of mostly classical music from the 17th to the beginning of the 20th century. This dataset is rather big and we have had to select sub-datasets to perform the training of the deep learning models in a reasonable amount of time.
Fig 1 shows (i) the total number of compositions performed and collected in the dataset per composer and (ii) the number of unique compositions per composer for the 14 main composers of the dataset.

We have decided to select composers exhibiting very distinct composing styles and whose total number of unique compositions were large enough to train the models without risking too many redundant inputs. We also selected composers well-known to the general audience to ease the assessment of the results. Based on that, we chose three composers – Chopin, Bach and Debussy- to train our generative models.
Data visualization and preprocessing
The MIDI files can be easily manipulated in python using (for instance) the Pretty_MIDI package created by Colin RAFFEL.[1]
The class of pretty_midi objects encodes the musical data of a MIDI file as a list of NOTE events of the form Note(start=0.000, end=0.124, pitch=76, velocity=80). In this example, the note of pitch number 76 is played between t=0 and t=0.124 second, with a velocity (meaning the force with which the note is played) equal to 80. Pitches and velocities are encoded between 0 and 127.
This representation of the musical track is not well suited for deep learning modeling. Therefore, to encode the data, we have used a matrix-based approach using the piano roll representation of a musical track.
The piano roll representation turns a list of Note events of a pretty-MIDI object into a [Pitches vs Timestep] 2d-matrix (see Fig. 2).

Here is a the code we use to plot a piano roll matrix using Librosa and Pretty-MIDI (in a streamlit web app):
The piano roll matrix can be seen as a concatenation of pitch vectors with dimension 128 (the maximum number of different pitches in MIDI standard) that are successively played at each different time step (here a column of the matrix) representing the flow of time.
The duration of a timestep is determined by the Sampling Frequency fs with: timestep = 1/fs. With fs=20, each column of the piano roll matrix represents 1/20 seconds of music.
Using this representation we can define the features and targets to be given to the deep learning models (see Fig 3):
- Features will be sequences of pitch vectors with length n_feat
- Targets will be the n_feat + 1 pitch vector that the models will learn to predict based on the feature sequences
3. Deep learning models
Music tracks are basically time series of notes. The notes in a music track, as opposed to noise, are not played in a random order but are deeply connected to each-other by the notions of harmony, melody and so on. Accordingly, the deep learning models we use to generate music must be able to learn the musical "context" i.e. to keep track of notes separated by a relatively large amount of time. We then have used two deep learning architectures that are known to be efficient in this task: dilated convolutional networks and recurrent neural networks.
The deep learning models we built had also to be able to predict pitch vectors containing zero notes (silences), or several notes (a chord), which are very common and important features of music tracks. To allow it in a simple way in our modelizations, we have used sigmoid activation functions in the output layers of all our models.
Dilated convolutional models

The principal of dilated convolution is presented in Fig 4. The main advantage of dilated convolutional models is that, with an adequate choice of layers and hyperparameters, it can, in a reasonable amount of time, produce decent results that can be used as a starting point for comparison purposes. We have started by designing a simple convolutional model with a few layers (called Conv1D in the Pyzzicato app but not discussed here) and, motivated by the reasonably good results obtained, we have built a more complex convolutional model, inspired by the famous WaveNet architecture (we call it "pseudo-wavenet"). This model is composed of five convolutional 1D blocks (with dilation rates between 2 and 16) and one classification block, as described Fig 5. Note the sigmoid activation function in the last layer to allow the generation of multiple one-hot-encoded note events in a single pitch vector.

See the code here:
RNN (LSTM) model
Lstm is an efficient model in time series prediction and particularly adapted to our purpose. This is because of their property of selectively remembering patterns for short or long durations of time.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell following the strategy described in Fig 6:

Our LSTM model is a sequential model with 2 LSTM layers with dropout and 3 Dense layers as shown in Fig 7.

See the code here:
Models have been trained thanks to the free GPU capabilities of Google Colab.
How do we generate musical tracks ?
To generate a track, we give an input sequence (called "seed") to one of the trained models and ask it to sequentially predict the next pitch vectors, so that the models progressively generate a novel music track. A few parameters such as the total duration of the generated track, or the threshold value for the sigmoid output layer can also be tuned.
Here is the function we use:
4. Results
Principle of the analysis of Generative Models
Estimating the performances of Generative Models is an open question and confronts the basic concept of creativity. In general, we lack quantitative metrics to estimate the quality of the outputs. Indeed, when we listen to a musical track produced by a Generative Model, how can we judge its quality?
The most common answer to this question is the subjective evaluation method, based on criteria such as:
- 1/ Musicality: Does the succession of notes and chords resemble what is commonly accepted as "music"? (i.e. can we distinguish a "melody" in it? Does it "sound like" a piece of music that a human could have composed?)
- 2/ Fidelity: Does this music contain some characteristic patterns? Does it belong to the expected "musical genre" (classical music, rap, rock etc)?
- 3/ Expressivity: A very subjective criteria corresponding to the question "does this music instantiates "emotions" in the listener’s mind?"
Even though the best way to evaluate the outputs of this project is to listen to the tracks (using the pyzzicato app !), the piano_roll representation allows for a visual estimation of the first two qualitative criteria, Musicality and Fidelity.
Summary of the Results:
Both pseudo-WaveNet and naive-LSTM models were able to produce good results in terms of musicality and fidelity, even though some subtle behavioral differences can be spotted between the two models. Here are some telling examples:
Example 1:

This first example using 10 seconds of Bach Chromatic Fantasy as seed illustrates the ability of our models to generate audible music: the notes are rather regularly spaced, pitches are varied and fall within the same range as the one seen in the seed. Chromatic rises and/or descents evoking the ones seen in the seed are visible in both tracks. However pseudo-WaveNet seems to exhibit a better fidelity as the note durations are closer to the ones seen in the seed track.
Click here to listen to the seed sequence:
And the track generated using the pseudo-WaveNet model:
Example 2:

This second example using 10 seconds of Chopin’s Ballade n°1 as seed illustrates the fact that naive-LSTM is more at ease with slower tempos and more sustained notes compared to pseudo-WaveNet. Even though both models generate rather similar outputs in terms of melodies, LSTM manages to reproduce the long-sustained notes visible at the beginning of the seed, whereas pseudo-Wavenet produces a silence (between 6s and 7s in the track).
Click here to listen to the seed sequence:
And the track generated using the pseudo-WaveNet model:
Example 3:
One more example to listen to, using a seed extracted from Chopin’s Mazurkas (op. 17):
The most convincing track is generated by the naive-LSTM model:
Limitations
While being reasonably satisfying, the above examples of generated tracks also highlight the main limitations of our models:
- Not-that-good generalization: the outputs strongly depend on the chosen seeds and sometimes the models produce very few notes, or even complete silence, or noisy outputs
- Difficulty to generate long musical tracks: our models have trouble when we ask them to generate tracks exceeding the length of the training sequences (here 10s). They have a tendency to "dry out" and generate fewer and fewer notes as time goes.
5. Use cases and potential improvements
Potential use cases
Belonging to the field or creation, proposing realistic use cases for this project is no easy task. However, we could imagine this work to be used for:
- Automatic generation of waiting/holding music for call centers or administrations
- Melody suggestion tool for music producers/composers
Potential improvements
- Normalizing the MIDI dataset to control the variability of our seed sequences in terms of tempo, pitch ranges, etc.
- Use state-of-the-art Generative Adversarial Models (GANs)
- Improve training data management by using generators in order to increase the size of our training datasets (matrix-based approach is rather greedy in terms of memory needs)
- Use another paradigm for data encoding, such as NLP-inspired approaches instead of one-hot encoded piano roll matrices
- Implementing quantitative metrics for results assessment. Our approach is mostly based on subjective evaluation, but some references found in literature propose interesting statistical approaches [2].
6. In summary
As a conclusion, we first would like to highlight the fact that the results presented here are very encouraging in spite of the relative simplicity of our three modelization architectures.
We have in this project managed to produce music using three different generative models ranging from convolutional models that handle sequential data, to more advanced recurrent neural network models that account for longer path dependence in the data structure such as RNN.
We have observed a real musical coherence between the seeds and corresponding outputs, especially for the Pseudo WaveNet and LSTM models.
As discussed in the article, a larger dataset would have for sure improved our results, first and foremost by reducing the overfitting on training data. It would also have been extremely interesting to implement quantitative metrics to evaluate the quality of the generated tracks. Due to a lack of time, we have limited our analyzes to a subjective evaluation method.
Art creation using AI deep learning models is a very young field of research with huge successes ahead. We are very happy and proud to have had the chance to get a glimpse of what can be obtained using these innovative algorithms!
Interested in the topic ? Here is a non-exhaustive list of amazing projects:
- Magenta by Google
- MuseNet by OpenAI
- Flow Machines by Sony
- The AIVA startup
and many more !
7. What’s next ?
First of all we would like to expand the subjective evaluation of the results presented here by offering the reader a way to generate their own music tracks and evaluate the results. This is possible thanks to the Pyzzicato web app that we have developed. It is available on the datascientest studio !
Secondly, we are open to collaborations with anyone interested in designing new types of deep learning generative models. The codes used for preprocessing and model training are available on github so feel free to take a look !
Have fun with the Streamlit App here !
Many thanks to Juliette Voyez for her help throughout this adventure !
[1] Colin Raffel and Daniel P. W. Ellis. Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty_midi. In 15th International Conference on Music Information Retrieval Late Breaking and Demo Papers, 2014.
[2] See Yang et al., On the evaluation of generative models in music, Neural Computing an Applications, May 2020 and references therein for more details.