
Audio is a domain where the cross-pollination of ideas from computer vision and NLP domains has broadened the perspective. Audio generation is not a new field, but thanks to research in the Deep Learning space, this domain has seen some tremendous improvements in recent years as well. Audio generation has several applications. The most prominent and popular ones nowadays are a series of smart assistants (Google Assistant, Apple Siri, Amazon Alexa, and so on). These virtual assistants not only try to understand natural language queries but also respond in a very human-like voice.
_This article is an excerpt from the book, Generative AI with Python and TensorFlow 2. This book is written by Joseph Babcock and Raghav Bali (myself). Our aim was to create a practical guide for the readers that can enable them to create images, text, and music with VAEs, GANs, LSTMs, GPT models and more. This book has helped numerous Python programmers, seasoned modelers, and machine learning engineers understand the theory behind deep generative models and experiment with practical examples._
Music is a continuous signal, which is a combination of sounds from various instruments and voices. Another characteristic is the presence of structural recurrent patterns which we pay attention to while listening. In other words, each musical piece has its own characteristic coherence, rhythm, and flow.
In this article, we will approach the task of music generation in a very simplified manner. We will leverage and extend a stacked LSTM network for the task of music generation. Such a setup is similar to the case of text generation (this is a topic for another upcoming article). To keep things simple and easy to implement, we will focus on a single instrument/monophonic music generation task.
The following is an outline of our workflow for this walk-through:
- Getting to know the dataset
- Prepare dataset for music generation
- LSTMs based music generation model (did we say attention!)
- Model Training
- Listen to the beat! Let’s hear out a few samples our model generates
Let’s first get to know more about the dataset and think about how we would prepare it for our task of music generation.
_The code used in this article is available through Github repositories [1] and [2]. More interestingly this is nicely packaged in a Google-Colab enabled jupyter notebook which you can simply click and use._
The Dataset
MIDI is an easy-to-use format which helps us extract a symbolic representation of music contained in the files. For this discussion/walk-through, we will make use of a subset of the massive public MIDI dataset collected and shared by reddit user _u/midiman, which is available at this link: r/WeAreTheMusicMakers
We will leverage a subset of this dataset itself. The subset is based on classical piano pieces by great musicians such as Beethoven, Bach, Bartok, and the like. The subset can be found in a zipped folder, midi_dataset.zip
, along with the code in this GitHub repository.
We will make use of music21
to process the subset of this dataset and prepare our data for training the model. As music is a collection of sounds from various instruments and voices/singers, for the purpose of this exercise we will first use the chordify()
function to extract chords from the songs. The following snippet helps us to get a list of MIDI scores in the required format.

Once we have the list of scores, the next step is to extract notes and their corresponding timing information. For extracting these details, music21
has simple-to-use interfaces such as element.pitch
and element.duration
. The following snippet helps us extract such information from MIDI files and prepare two parallel lists.

We take one additional step to reduce the dimensionality. While this is an optional step, we recommend this in order to keep the task tractable as well as to keep the model training requirements within limits. The following snippet reduces the list of notes/chords and duration to only the C major key (you may select any other key as well).

Now we have pre-processed our dataset, the next step is to transform the notes/chords and duration-related information into consumable form. One simple method is to create a mapping of symbols to integers. Once transformed into integers, we can use them as inputs to an embedding layer of the model, which gets fine-tuned during the training process itself. The following snippet prepares the mapping and presents a sample output as well.

We now have the mapping ready. In the following snippet, we prepare the training dataset as sequences of length 32 with their corresponding target as the very next token in the sequence.

As we’ve seen, the dataset preparation stage was mostly straightforward apart from a few nuances associated with the handling of MIDI files. The generated sequences and their corresponding targets are shown in the following output snippet for reference.

The transformed dataset is now a sequence of numbers, just like in the text generation case. The next item on the list is the model itself.
LSTM Model for Music Generation
Unlike text generation (using Char-RNN) where we usually have only a handful of input symbols (lower- and upper-case alphabets, numbers), the number of symbols in the case of music generation is large (~500). Add to this list of symbols a few additional ones required for time/duration related information as well. With this larger input symbol list, the model requires more training data and capacity to learn (capacity in terms of the number of LSTM units, embedding size, and so on).
The next obvious change we need to take care of is the model’s capability to take two inputs for every time-step. In other words, the model should be able to take the notes as well as duration information as input at every time-step and generate an output note with its corresponding duration. To do so, we leverage the functional tensorflow.keras
API to prepare a multi-input multi-output architecture in place.
Stacked LSTMs have a definite advantage in terms of being able to learn more sophisticated features over networks with a single LSTM layer. In addition to that, attention mechanisms helps in alleviating issues inherent to RNNs, such as difficulties in handling long-range dependencies. Since music is composed of local as well as global structures which are perceivable in the form of rhythm and coherence, attention mechanisms can certainly make an impact.
The following code snippet prepares a multi-input stacked LSTM network in the manner discussed.

The model prepared using the above snippet is a multi-input network (one input each for notes and durations respectively). At a high level, the following is describes the model setup:
- Each of the inputs is transformed into vectors using respective embedding layers.
- Both inputs are concatenated
- The concatenated inputs are then passed through a couple of LSTM layers followed by a simple attention mechanism.
- After this point, the model again diverges into two outputs (one for the next note and the other for the duration of that note). Readers are encouraged to use
keras
utilities to visualise the network.
Model Training and Music Generation
Training this model is as simple as calling the fit()
function on the keras
model object. We train the model for about 100 epochs. The figure below depicts the learning progress of the model across different epochs.

As shown in the figure, the model is able to learn a few repeating patterns in the generated music samples. Here, we made use of temperature-based sampling as our decoding strategy [_reference link_].
Let’s Hear The Sound of AI
The following are two samples generated by our model after 5 and 50 epochs respectively. They are not perfect but certainly promising, just like an upcoming musician.
Observe the improvement between these two samples. The output after 50 epochs is more coherent and rhythmic than the one after just 5 epochs.
Summary
This was a quite simple implementation of music generation using deep learning models. We drew analogies with concepts on text generation. In the book that this article is based on, we further perform some music generation using other advanced techniques including Generative Adversarial networks such as MuseGANs.
About the authors
Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Raghav Bali is an author of multiple well received books and a Senior Data Scientist at one of the world’s largest healthcare organisations. His work involves research and development of enterprise-level solutions based on Machine Learning, Deep Learning, and Natural Language Processing for Healthcare and Insurance-related use cases. His previous experiences include working at Intel and American Express. Raghav has a master’s degree (gold medalist) from the International Institute of Information Technology, Bengaluru.