MusicAE: A Musician Friendly Machine Learning Tool

An overview and example of how to make machine learning tools accessible to musicians in the digital age

Theo Jaquenoud

Published in

Towards Data Science

8 min readMar 23, 2021

Written by Theo Jaquenoud, Samuel Maltz, and Zachary Friedman

What musical ML tools currently exist?

Over the last few years, a number of tools have been published to introduce machine learning (ML) to the world of music. Most notably, Google has published dozens of fun apps and prototypes using Magenta, “an open source research project exploring the role of machine learning as a tool in the creative process.” Of the myriad of ML tools available, some work to enhance artists’ work, but many seek instead to replace the artist entirely.

Ai-Duet lets you play a virtual piano, and automatically performs a duet along with you. Celebrating Johann Sebastian Bach was a Google Doodle (apps and animations that replace the Google title on the search page) which allows you to compose a single melody, and it generates a four-part counterpoint in the style of Bach, while admittedly breaking some of Bach’s fundamental rules of which he wrote extensively. On a sillier note, Blob Opera lets you control a group of blob shaped opera singers, inputting the pitch and vowel in real time. While such programs can be extraordinarily complex, and are fun to use, their use cases are more-or-less relegated to novelty; no serious artist would produce music through a singing blob or imitative piano program. However, another breed of music ML is rising alongside this one, one that expands the tools available to an artist, rather than replacing them entirely.

Blob opera singers are astonished by their soprano's coloratura performance | by Google Arts and Culture

Two of the most popular such tools are Tone Transfer and NSynth. Tone Transfer allows musicians to upload short samples played on any instrument (or non-instrument), and it outputs a reinterpretation of the sample performed on one of four instruments: flute, saxophone, trumpet, or violin. It captures the pitch and inflections of the original piece surprisingly well, but where it fails is in the interface. The tool is only available online as a web app, and it clunkily requires artists to upload and download their short samples one at a time, with no easy connection to common programs such as digital audio workstations (DAWs). NSynth on the other hand present extensive open-source documentation on how to build the physical instrument, with standard interfaces like Musical Instrument Digital Interface (MIDI) or USB. The problem with NSynth is that, even if you can assemble it, it only works as a stand alone instrument, and can’t be implemented as software for the synthesizers or MIDI controllers that most musicians already own.

An example of an assembled NSynth built on top of a Raspberry Pi with MIDI interface | by NSynth Super

So what would constitute an ideal tool for musicians to gain access to machine learning? In many of the examples we’ve shown above, the AI is replacing, or attempting to replace, the musician. In some cases, it tries to imitate a composer, in other cases a performer, but it tends to fall short. We think ML techniques can be most successful when they are used to enhance the artist’s creative experience. Additionally, the tool must be accessible, intuitive to use, and fit into a musician’s workflow. The best way to guarantee this last point is to build the tool on top of platforms that musicians are already familiar with, and in this digital age, the gold standard is the digital audio workstation. If the machine learning methods can be implemented as a plug-in in a standard format like virtual studio technology (VST), musicians will be able to integrate it seamlessly into their normal workflow.

Introducing MusicAE

Mission statement

MusicAE uses advanced machine learning techniques to create a unique representation of those elements of sound that are difficult to quantify, and provides a simple interface to creatively interact with them.

How does it work?

MusicAE is neural network powered synthesizer and effects tool that uses an autoencoder architecture to learn the “fingerprints” of certain sounds and smoothly interpolate between them. Autoencoders work by using raw data, such as audio signals, as inputs. They learn to encode or compress the data using less information, a process called dimensionality reduction, and then symmetrically recomposes the data into its original form. In the center of the decoder, the entirety of the data is represented in as few numbers as possible. This is known as the latent space, and represents the distilled essence of the input data, the “fingerprint.”

Graphical representation of an autoencoder neural network architecture | by Steven Flores

We train our autoencoder using a corpus containing tones generated by a variety of synthesizers, organs, and other keyboard instruments. Over several epochs of training, the autoencoder learns how to efficiently compress sounds into a lower dimensional latent space, and then decode it back into a nearly identical sound. Where this neural network becomes creative is when you bypass the encoder and simply introduce a vector of numbers into the latent space. The network will decode the latent vector as if it had been encoded from sound, and will output a short audio sample, unique from any of the sounds it was trained on. This is where the musical tool comes in. We envision three potential uses for MusicAE:

Synthesizer: This is like the example described above. There is no musical input to the autoencoder, the user simply defines a latent vector, either on a GUI or using a MIDI controller, and the decoder network creates a sound. Smooth changes to the latent space correspond to smooth changes in the output, providing musicians with a brand new way of synthesizing unique sounds.
Effects: In effects mode, the autoencoder uses an audio stream as an input, and the decoder is able to reconstruct it. By introducing small changes to the latent space, the sound is transformed in various subtle ways, much like if passed through distortion, equalizers, or other effects.
Mixer: In this mode, two parallel networks simultaneously encode two audio streams. Then, much like a fader could allow you to transition smoothly between tracks, the two latent spaces are mixed together, and decoded as one. Users can not only choose how much of each track to include in the mix, but can also control the mixing for each dimension in the latent space, resulting in creative mixes.

Summary of the three operations provided by MusicAE | by authors

Unlike a traditional synthesizer which has dedicated controls for features like pitch, waveform, attack, decay, center frequency, harmonics, et cetera, the autoencoder learns these characteristics of sounds and abstracts them into a latent space in a way that amorphously combines them all. In the three uses described above, the artist gets to discover a new dimension, or several, to the sounds and tools they are used to working with. Being able not only to fade between two sounds, but to intricately mix them based on their latent representations, is an entirely unique experience that no other tool currently provides.

Demonstration

An example of the mixing and synthesizer modes are shown below using a prototype graphical user interface (GUI). For the mixing, the horizontal slider controls how much of each track to combine into the latent space for decoding, whereas the ten vertical sliders combine the track’s individual latent dimensions. In this prototype GUI, the autoencoder models are pre-trained and can be loaded in via a file name, as can the input audio files in the case of mixing and effects. In the demo video, one input sound is a pulsating synth bass, and the other is an organ. Notice how the different qualities of each sound, such as the pulsations of the bass, combine as different latent dimensions are changed. In the case of the synthesizer, each slider determines the value of one of the dimensions in the latent space, and we hear the decoded output.

Working towards a plug-in

As we discussed in the first section, the quality of a musical tool, and how much use it can be to a musician, is directly related to how well it fits into their workflow. Of the few AI musical programs out there that are intended for use by musicians, none have adequately considered the environments within which they engage with their creative process. Today, virtually all music is processed with a DAW, whether the recording itself is digital, or simply being mastered. If you want musicians to use your ML tool, you must make it work with DAWs!

The most common way to create programs that can work within a DAW is using plug-ins and widely supported platforms like VST. They allow musicians and programmers to create their own GUIs which can interact directly with music loaded into a DAW, and can create effects, MIDI controls, or even entire virtual instruments. Platforms like VST are widely supported, and compatible with most DAWs. For MusicAE, we would like the plug-in to be able to read and write the audio directly to and from the DAW. Ideally, these operations should be as simple as highlighting tracks to select inputs and defining a blank track as an output, or loading in our synthesizer as a software instrument. The sliders controlling the latent space can be controlled directly on the screen, or mapped to analog knobs and sliders on a MIDI controller that is connected through the DAW.

While our original code was designed using python, which gave us easy access to the most widely used machine learning and neural network libraries, notably TensorFlow, VSTs and software instruments are almost universally written using the C++ programming language. Being able to port the trained autoencoder model into a different language, and learning to work within the constraints of VSTs has proven to be a challenge, but we have been able to replicate the functionality of our prototype GUI, and are now looking towards more intuitive interface to work seamlessly within any DAW. A mockup based on our existing GUI is shown below alongside the popular Mac OS app GarageBand.

Mock-up of MusicAE integrated as a plug-in for Mac OS based DAW | by authors

Conclusion

While there is much exciting work happening at the intersection of machine learning and music, few developers have seriously taken into consideration the needs of digital musicians. Our goal with MusicAE is not just to build a tool that presents musicians with a completely novel way of interacting with music, but also to help them integrate it into their workflow using methods they are familiar with. So far, we have experimented with different neural network architectures and found suitable models which we have tested in standalone GUIs. We now set our sights on fully incorporating this functionality into a VST plug-in to make our tool compatible with a majority of DAWs, thereby making it accessible to musicians.

This project was completed by Theo Jaquenoud, Samuel Maltz, and Zachary Friedman as part of a senior electrical engineering design project at The Cooper Union. We would like to thank our advisor Professor Samuel Keene for introducing us to this project, as well as Joseph Colonel, a Cooper Union alumnus and current PhD candidate in the Centre for Digital Music at Queen Mary University of London. He first conceived of this idea and provided much of the foundation on which we built our software.