VAEsemane — A Variational Autoencoder for musical human-robot interaction

Published in

Towards Data Science

5 min readApr 5, 2019

Humanoid robots solve increasingly difficult tasks since they steadily improve their skill to interact with their environment intelligently. The goal of this project is to interactively play music with a software synthesizer or the humanoid robot Roboy. Musical improvisation, the act of creating a spontaneous response to a currently presented musical sequence, demands a high understanding of an instrument and the presented musical sequence in terms of key, rhythm, and harmony which makes it particularly difficult to put this system into action.

Researchers at Google Magenta (MusicVAE) and ETH Zurich (MIDI-VAE) have recently been working on Variational Auteoncoders (VAEs) that generate music in software. A combined system of hardware and software was presented at Georgia Tech University called “Shimon robot”.

Fig. 1: Shimon robot developed at Georgia Tech University [Source: https://www.shimonrobot.com/]

For this project VAEs were chosen to be the basis for improvisation. VAEs have the great benefit that they produce a continuous latent space that allows reconstruction and interpretation of data. After a VAE was trained the decoder of the VAE can serve to randomly generate data that resembles the data it was trained on by sampling from a unit Gaussian. This random generation process is shown in one of the video examples below.

The System

Fig. 2: A full framework for robotic improvisation in real time

The system waits for the input of a musical sequence to respond with an improvised sequence. It was divided into the shown modules (see Fig. 2). Any MIDI controller, keyboard or synthesizer can be used to play musical sequences to feed the system with a live input. The sequence is processed by the encoder of a VAE that embeds it into latent space. The latent representation of the musical sequence also referred to as the embedding, can either be reconstructed or modified using a graphical user interface (GUI) called Latent space modifier. The GUI allows modifying the values of the 100-dimensional latent vector creating a new latent vector that is decoded by the VAE decoder to form a new musical sequence. Optionally, the sequence can be smoothed by a Note smoother module which regards short MIDI notes as noise and deletes them according to a user-defined threshold. The improvised sequence can be played by a software synthesizer or sent to the simulated robot Roboy via the Robot Operating System (ROS) such that the robot performs it on the marimba.

Instead of using a recurrent VAE (as shown by Google Magenta and ETH Zurich) this system uses a convolutional VAE. Fig. 3 shows the encoder of the VAE. The decoder consists of the inverse mapping of the encoder.

Fig. 3: Architecture of the VAE encoder (adapted from Bretan et al. 2017). The decoder consists of the inverse mapping without the reparametrization step. The figure was created using NN-SVG.

The VAE was implemented using PyTorch and trained with the MAESTRO dataset. Fig. 3 shows all convolution operations. There are 4 dense layers before a musical sequence is embedded into a 100-dimensional latent vector. The decoder consists of the inverse mapping of the encoder without the reparametrization step. Moreover, Batch Normalization and ELU activation function are applied for each layer. The code describing the VAE model at the heart of this framework is shown below. You can find the full source code of the project here.

The simulation uses the CARDSFlow framework (source code for CARDSFlow here, Paper here) which is a workflow for the design, simulation, and control of musculoskeletal robots developed by the Roboy Project (in cooperation with the Chinese University of Hong Kong). Figure 4 shows the pipeline from the design of a robot in Autodesk Fusion 360 to the control of the real hardware.

Fig. 4: CARDSFlow pipeline. It consists of (from left to right): 1) Robot design in Autodesk Fusion 360, 2) Simulation in Gazebo, 3) Muscle control through CASPR in Gazebo and 4) Control of the real robot.

The following image (Fig. 5) presents the setup of the simulated Roboy robot in RVIZ, a visualization tool of ROS.

Examples

There are five exemplary videos. Three of them show the VAE model generating music in software, whereas two videos show Roboy performing the improvisation. The GUI, also referred to as the latent space modifier, has 100 digital potentiometers to modify the latent vector of a currently embedded sequence. Modifying the latent vector results in an interpretation or improvisation since it marginally changes the Gaussian distribution of the currently embedded sequence. In the examples that include the simulated Roboy robot the green silhouette shows the target pose of the robot when hitting a marimba clave. Note: The simulated robot skips MIDI notes while it is performing a marimba strike. This is the reason why it plays fewer notes than generated in software.

Generate Mode

The first example shows the performance of the VAE model in “Generate Mode”. In this mode, a latent vector is randomly sampled from a unit Gaussian distribution and then decoded by the VAE decoder to generate unseen musical sequences.

Variational Autoencoder — “Generate Mode” (Software Generation)

Interact Mode

The following examples use the “Interact Mode” which allows human-robot interaction. The software/robot waits for human input and generates a musical response that is based on the current input sequence.

Variational Autoencoder — “Interact Mode” (Software Improvisation)

Variational Autoencoder and Roboy robot — “Interact Mode” (Robotic Improvisation)

Endless Mode

The following examples show the “Endless Mode” which waits for human input and then starts playing music based on the sequence until you stop it. A currently generated sequence (output from VAE decoder) is fed to the input to produce an interpretation. There is no interaction except for the first sequence that is supposed to give a guideline for the software/robotic improvisation. By changing the values of the digital potentiometers, you can traverse the learned latent space and, therefore, hear what the VAE model has learned.

Variational Autoencoder — “Endless Mode” (Software Improvisation)

Variational Autoencoder and Roboy robot — “Endless Mode” (Robotic Improvisation)

Roboy Project

Roboy Project is an open-source platform for musculoskeletal robot development with the goal to advance humanoid robotics to a state where its robots are as good as the human body. Roboy stands out from other robots as it is actuated with motors and tendons that mimic human muscles and tendons rather than motors in each joint. Check out the Roboy Project website or the Github account to get more info.