Neural Networks for Real-Time Audio: Stateless LSTM

Published in

Towards Data Science

10 min readMay 5, 2021

This is the third of a fIve-part series on using neural networks for real-time audio. For the previous article on WaveNet, click here.

In this article we will model a guitar amplifier using a Stateless LSTM neural network in real-time.

The LSTM model, which stands for “Long Short-Term Memory”, was developed in the mid 1990s and is a form of Recurrent Neural Network (RNN). Since then, the original model has been modified and applied to many different kinds of problems, including speech recognition and text-to-speech.

Instead of a “feed-forward” neural net like WaveNet, the LSTM has a recurrent state that is updated each time data flows through the network. In this way, information from the past can be used to make a prediction for the present. In other words, the network has a memory. This is the case for stateful LSTMs, but in this article we are going to try something different.

Overview

For this example, we will be using a Stateless LSTM, which means the network’s memory is reset for each batch. In this specific example, we will use single batches of audio data. Because we are using single batches, the LSTM essentially becomes a feed-forward network because there is no recurrent state.

Why would we use a stateless LSTM for single batches of data? Stateless LSTM is an effective method for audio, and by setting the internal states to zero, we reduce the complexity of the network and improve it’s speed. However, for our guitar amplifier example, the network still needs to know some information about the past signal to make an accurate prediction.

We can do this by adding 1-D convolutional layers prior to the stateless LSTM layer. In this example, we use two 1-D convolutional layers, followed by the LSTM, followed by a Dense (fully connected) layer. The input to the network is the current sample and a specified number of previous samples. The 1-D convolutional layers serve to abstract features from the audio as well as reduce the amount of data going to the LSTM layer which speeds up processing significantly.

Figure 1: Flow of audio data through network (Image by Author)

Note: The “Dense” layer performs the same function as the “Linear” layer from the previous article’s WaveNet implementation. This is simply a difference in naming conventions between Keras and PyTorch.

Keras Training

We will use the same 4 minute samples recorded from the Blackstar HT40 amplifier on the overdrive channel at 25% gain as explained in the previous article.

Keras/Tensorflow was chosen for implementing the Stateless LSTM model. Keras is a high level interface to Tensorflow, the A.I. framework developed by Google. The example code for the Keras training comes from the SmartAmpPro project on Github, and is contained in the train.py file. Using a Sequential Keras model is fairly straight-forward, as shown below.

Note: The SmartAmpPro project is a combination of the training and real-time code. For purely training code using the same model, see GuitarLSTM on Github.

A base model is created using model = Sequential() , and each layer is added in sequence to the model using .add(layer_type(layer_params,...)). In the above code, each layer’s parameters are set by previously defined variables. In Keras, the LSTM layer is stateless by default, so the only parameter needed is the number of hidden_units. This parameter determines the size of the LSTM. The input_size defines how many previous audio samples will be used to predict the current sample. The default setting used later in the real-time code is 120, which was chosen based on testing for accuracy vs. processing speed. This means that the current audio sample and previous 119 samples are used to predict what the next sample’s value should be. The following figure shows how ranges of audio data are fed to the network for a given signal.

Figure 2: Example of how ranges of audio are fed to the network (the spacing between samples shown here is exaggerated) (Image by author)

Note: The Conv1D layers use the “stride” parameter which is used to skip over data points in the convolution. For a stride of 2, the network layer would skip every other data point for each convolution operation. This speeds up computation while retaining enough information to make an accurate prediction.

After initializing the Sequential model, the input audio data must be processed. For the case of input_size = 120, the audio data is sliced (using tf.gather) to get 120 samples for each existing sample of audio. Each batch of 120 audio samples is a new input to the network. The order of the input batches are randomized to improve training performance. This slicing operation is only done for the input .wav file, not the output .wav. If you have an input .wav containing 44100 samples (or 1 second of audio) with an input_size=120, then the result after slicing would be an array of shape:

(44100 - input_size + 1, 120) or (43981, 120)

The reduction in sample size is needed because for the first 119 samples of audio, we can’t look 120 samples into the past to make a prediction. But now, instead of 43981 single samples of audio, we have 43981 arrays of 120 samples which overlap with each other. The data loading and processing is shown below.

Note: A custom data loader can be used to slice 120 samples for each data input, rather than processing the whole wav file prior to training. This saves RAM usage while training. This is implemented in the Colab script in the SmartAmpPro project.

The training is kicked off by the model.fit() function. The randomized audio data (X_random and y_random) are inputs to the fit function, along with number of epochs, batch size, and how to split the data for validation and verification.

Training Results

The training for this particular LSTM implementation is very fast. The 1-D convolution layers (with default SmartAmpPro settings) reduce each 120 sample input down to 4 samples that are fed to the LSTM layer. Where the previous WaveNet took 24+ hours on a particular CPU, this model only takes 3 minutes on the same CPU. However, the training is not as accurate as the WaveNet model. Higher model parameters can be chosen to improve accuracy, but this causes problems when running in real-time.

Here are the results of a training session of the Blackstar HT40 over 30 epochs. A loss value of 0.11 was achieved. The loss function used here is a variation of MSE (mean-squared-error) defined in the amp emulation paper¹. A comparison plot of 8 milliseconds of data is shown here:

Figure 3: Predicted vs. actual signal for HT40 amplifier using stateless LSTM model

The amplitude of the signal does not match up as closely as the WaveNet model, but the main features are still learned. The difference in predicted vs. actual audio is compared below:

Actual Blackstar HT40 amp (Overdrive channel, 25% gain):

Predicted by Keras model:

Model Conversion

Before using the trained model in real-time, the model must be converted into a suitable format for loading into the plugin. The format chosen here is “json” for readability and general acceptance in the computing world. Keras uses “.h5” format for model state data, which is the HDF5 compressed data format. Python code within the SmartAmpPro “train.py” script is used to perform this conversion. Prior to conversion, the additional “input_size” and “stride” parameters are added to the .h5 model file. These parameters will be needed in the real-time code.

Real-Time Implementation

The example code for the real-time implementation is also from SmartAmpPro on Github. The code uses NumCpp for matrix calculations and json for loading the converted json models. NumCpp is a header-only c++ implementation of the Numpy Python library.

The real-time audio plugin uses the JUCE framework, which is a cross platform c++ framework for creating audio applications. The basic goal here is to recreate the forward pass through the Keras Sequential model in high performance c++ code. In order to convert the code to c++, I wrote an intermediate Python script to make sure I understood the underlying layer calculations. Pytorch and Tensorflow use slightly different methods of processing layers, so it’s critical that the real-time application processes the layers in the exact same way as the training code.

The model data (state parameters) from the converted json model is loaded and set in the “ModelLoader” class. An example of a trained json model can be viewed on Github. The data from ModelLoader is then used to instantiate the “lstm” class, which also contains the 1-D convolutional layers and dense layer.

Here is the main processing method of the lstm class. If you are familiar with JUCE, this is what you would call in the processBlock() method of the PluginProcessor:

We have to perform the same audio slicing from the training code, which takes some careful handling of the audio buffers, since information from a previous buffer (or block) is needed to predict samples in the current buffer. The buffer size is checked with check_buffer(numSamples); (there’s probably a better way to handle this, but if the user changes the buffer size the lstm class needs to know). Then the set_data method is then called to arrange the audio for input to the LSTM inference code. Let’s look at what this is doing:

In the above code, the end of the previous buffer is set at the beginning of the temporary buffer, new_buffer. Then the current buffer’s data is assigned to the end of new_buffer. The slicing operation is performed, taking each audio sample and appending the previous input_size of samples to the 2-D array data. The old_buffer is set equal to the current buffer to be used for the next audio block. Now, each input array from data can be fed to the first 1-D convolutional layer.

The 1-D convolution was the most complicated layer. There may be a simpler and more efficient way to do the calculations than what is written here. The NumCpp library is used for all matrix calculations, and the main data type used here is the nc::NdArray<float>.

This is the zero-pad function, which adds zeros to the input data to make the control the output shape:

This is the unfold function, which slices the data in the same way as the audio processing from earlier. These sliced arrays are used to perform the convolution calculations.

The previous two methods are called from the main conv1d_layer method, which takes the unfolded matrix and performs a tensordot (or einsum) operation on the matrices to finish the convolution.

Note: The second Conv1d layer is mostly the same as the first, but it handles the input differently. It is left out here for brevity, but can be viewed in full on Github.

After processing the first two Conv1d layers, the LSTM layer is processed. This is a simplified version of the algorithm presented in the guitar emulation paper¹. It is simplified, because for a stateless LSTM, the initial cell state and hidden state can be set to zero.

First, a dot product is performed between the output from the Conv1d layers and the trained LSTM weights from the .json file. The bias is also added here.

gates = nc::dot(conv1d_1_out, W) + bias;

The current hidden state is calculated within the for loop, where the number of loops is determined by the the LSTM’s hidden_size HS. The full set of equations for a stateful LSTM can be seen in the PyTorch documentation. For the stateless implementation, the previous hidden state h(t-1) can be set to zero, and the cell state does not need to be calculated, since it’s not used for the next LSTM state. The stateless LSTM implementation is shown here:

Finally, the output of the LSTM layer is fed to the Dense layer, which is simply a dot product of the input and the trained weights from the .json file plus the bias vector. The output from this layer is a single sample of predicted audio. Repeat 44,100 times for the next second of sound. Whew!

Real-Time Performance

The above algorithm runs faster than the previous WaveNet implementation, generally speaking. However, if you increase the layer parameters higher than the SmartAmpPro defaults, it quickly becomes too slow to run in real-time, especially for the Conv1d layers. An input_size larger than 120 will also slow down processing.

Because the model only takes into account the previous 119 samples (about 27 milliseconds), the predicted signal can sometimes take wild swings in the wrong direction but generally corrects itself quickly. When compared with plots of the WaveNet output, the signal can be much shakier, and this can have a buzzing/noise effect on the sound. In my opinion, the WaveNet has a smoother sound. The smoothness of the sound can be improved with a higher number of epochs. It should be noted that the sound can be very dry, so additional effects like reverb should also be used for a more natural sound.

The training speed is the biggest advantage of the Stateless LSTM, and for an “easy” to train sound like the TS-9 Tubescreamer pedal, a high accuracy can be obtained in less than 3 minutes on a CPU. Here is a two-track recording using the SmartAmpPro real-time plugin (plus additional reverb). This demo shows how the same underlying model can be used for two different guitar sounds (clean and overdriven).

In the next article we will investigate using a stateful LSTM, to see if we can improve training accuracy and real-time performance.

Neural Networks for Real-Time Audio: Stateful LSTM

towardsdatascience.com

Thank you for reading!

Alec Wright et al., “Real-Time Guitar Amplifier Emulation with Deep Learning” Applied Sciences 10, no. 3 (2020): 766.