Neural Networks for Real-Time Audio: Stateful LSTM

This is the fourth of a five-part series on using neural networks for real-time audio. For the previous article on Stateless LSTMs, click here.

Published in

Towards Data Science

7 min readMay 22, 2021

We will revisit the LSTM for our last neural net model. This time we will use the stateful version and make use of its recurrent internal state to model the Blackstar HT40 guitar amplifier.

For a quick refresher; LSTMs (Long Short-Term Memory) are a type of recurrent neural network commonly used for tasks such as text-to-speech or natural language processing. They have a recurrent state which is updated each time new data is fed through the network. In this way, the LSTM has a memory. In this article we will use the LSTM model presented in the paper “Real-Time Guitar Amplifier Emulation with Deep Learning”¹.

Overview

Using a stateful LSTM allows us to simplify the overall network structure from the previous article. We will not be using the two 1-D convolutional layers or the data pre-processing. All we need here is the LSTM layer followed by a Dense layer.

Figure 1: General network architecture¹ (Image by Author)

A single audio sample is fed to the network and a single sample is predicted. There is no need for a range of samples since the necessary information about the past signal is stored in the LSTM’s recurrent state.

It’s important to note that a skip connection is performed, where the input sample value is added to the output value. This way, the network only has to learn the difference between the output and the input. This technique is defined in the amp emulation paper¹.

PyTorch Training

The example code for the PyTorch training comes from the Automated-GuitarAmpModelling project on Github. This project does not contain a license file (at the time of writing this), so I will not be displaying the code here, but you can view the full Python code on Github. I used the “SimpleRNN” model defined in the networks.py file. This includes the implementation of the LSTM network described in the amp emulation paper¹.

In the previous articles we used the Blackstar HT40 amplifier’s overdrive channel on 25% gain. Initial tests with the Stateful LSTM went so well that I decided to crank up the gain to 100% to see what the network could handle. I made a new recording using my Fender Telecaster and the HT40. The HT40 was once again mic’d with a Shure SM57 dynamic microphone.

I used a Focusrite Scarlett 2i2 audio interface and the Reaper DAW (Digital Audio Workstation) running on Windows 10. The recording setup used a signal splitter from the guitar, with one signal going to channel 1 of the audio interface and the microphone output going to channel 2. In this way, both tracks could be recorded simultaneously with minimal latency between the two signals. The LSTM model is highly sensitive to any time shift between the two signals.

I moved the recording laptop and audio interface away from the amp to reduce any noise from electronic interference. The pedalboard on the left of the image below was only used for the signal splitting pedal.

The area around the amp was surrounded by noise dampeners (I just used blankets) to reduce echoes while recording. It’s important to remove room reverb as much as possible to improve training accuracy.

Mic’d Blackstar HT40 amp (Image by Author)

The training was accomplished by running the PyTorch code on Google Colab, which is a browser-based option for running python code in the cloud. This website even grants you access to GPUs and TPUs for free, with some limitations. Using the GPU runtime, I was able to train the HT40 model in about 40 minutes.

For the network size, the amp emulation paper¹ tested LSTM hidden size values of 32, 64, and 96. The bigger the network, the more complex signals it’s able to learn, but at a cost of processing speed. I found that a hidden size of 20 was fully capable of reproducing the sounds I recorded while improving real-time performance.

A loss value of 0.069 was achieved for the HT40 amp at 100% gain. The training code uses a technique called “adaptive learning rate” which reduces the aggressiveness of the training throughout the session. You can think of this as fine-tuning the model as it gets closer to the target.

Here is a sample of the input recording from my Fender Telecaster:

This is the sample from the Blackstar HT40 at 100% gain:

And here is the PyTorch prediction of that same sample:

If we compare the actual vs. predicted signal, we can see a close alignment of both the amplitude and features of the audio. The plot below was made using Matplotlib, a library for creating graphs with python.

Figure 2: Actual vs. predicted signal of HT40 amplifier at 100% gain (loss 0.069)

Model Conversion

Before using the trained model in real time, the model must be converted into a format for loading into the plugin. The Automated-GuitarAmpModelling project automatically exports the model state data to a json file. The json file contains information about the layers such as the number of LSTM hidden units and input/output size. The majority of the data comes from the trained weights and biases for the LSTM and Dense layers. The weights and biases are arrays of values that will be used by the real-time code.

Real-Time Implementation

The example code for the real-time implementation is presented below using Numcpp for matrix calculations. The full plugin implementation using the JUCE framework has not been released yet, but I’ll update this article once it’s public.

This is the main processing method of the LSTM/Dense layer inference. For each audio sample in the buffer, the lstm_layer() and dense_layer() is processed. For the skip connection, the input sample is added back to the output from the network.

This is the LSTM layer, as implemented from the algorithm presented by the amp emulation paper¹. “c_t” and “h_t”(cell and hidden states) are calculated for each index based on the LSTM layer hidden size “HS”. I used a hidden size of 20 based on testing for accuracy and real-time performance.

The dense layer reduces the LSTM output to one audio sample. The dense layer is simply a dot product of the LSTM output and trained dense layer weights, plus the bias value.

These calculations are repeated 44,100 times per second of audio. Other digital effects can be added before or after the LSTM model, such as equalization or reverb. Multiple models can be used to cover the range of a particular setting such as gain. In the amp emulation paper, they use parameter conditioning to train a single model over multiple samples of audio, each at a different gain setting. The model can then cover the whole parameter space of a given knob setting.

Note: Other sample rates besides 44.1kHz are also common in audio, such as 48kHz or even 96kHz. The network will run at these higher sample rates, but more testing is needed to determine how this affects the audio. It may be that since the training data is 44.1kHz, this is the only rate that can be accurately predicted. A sample rate converter may be the solution for accounting for different input sample rates.

The amp emulation paper¹ mentions that future work may include implementing anti-aliasing to further improve the emulated sound. Aliasing is something that occurs in recorded audio where higher harmonics created by non-linear functions can reflect back into the lower frequencies, producing undesired distortion in the audio. For a great explanation on a technique for anti-aliasing, check out this article.

Performance

Of the three different neural net models, I found the stateful LSTM to be the most successful on reproducing a range of guitar amplifiers and pedals at real-time speeds. It was able to handle high gain, such as the HT40 used here, and clean sounds. The moderate training times won out over the WaveNet model, and the simplicity of the code makes it preferable over the conv1d layers + stateless LSTM in the previous article.

Note: The amp emulation paper¹ defines another RNN model, the GRU (gated recurrent unit). This is included in the Automated-GuitarAmpModelling code, and preforms similarly to the LSTM. The paper¹ notes that the GRU runs slightly faster, but may not be able to handle a variety of sounds as well as the LSTM.

Here is a video demo of a plugin built with the JUCE framework running the stateful LSTM inference on several trained models:

Video by Author

We have now covered three different neural net architectures and their real-time implementations in c++ using the JUCE framework.

Thank you for reading! In the final article, we will build a guitar pedal using the stateful LSTM model and the Raspberry Pi.

Neural Networks for Real-Time Audio: Raspberry-Pi Guitar PedalThis is the last of a five-part series on using neural networks for real-time audio.For the previous article on Stateful…
link.medium.com

Alec Wright et al., “Real-Time Guitar Amplifier Emulation with Deep Learning” Applied Sciences 10, no. 3 (2020): 766.