Neural Networks for Real-Time Audio: WaveNet

Published in

Towards Data Science

9 min readMay 3, 2021

This is the second of a fIve-part series on using neural networks for real-time audio. For the previous Introduction article, click here.

In this article we will model a guitar amplifier using WaveNet in real-time.

WaveNet was developed by the firm DeepMind and presented in the 2016 paper Wavenet: A Generative Model for Raw Audio¹. It explains how the model can be used for generating audio such as realistic human speech. It is a feed-forward neural-net, meaning the information only moves forward through the network, and does not loop as with RNNs (recurrent neural nets).

Overview

The specific implementation of WaveNet used here is defined in the paper Real-Time Guitar Amplifier Emulation with Deep Learning². In this paper, dilated convolutional layers are used to capture the dynamic response of guitar amplifiers. The dilated layers serve to increase the receptive field of the network, allowing it to reach farther back in time to predict the current audio sample. The more the network knows about the signal in the past, the better it can predict the value of the next sample.

Audio data is one-dimensional. It is a numerical value (amplitude) that varies over time. A visual example of a 40 second sound file (.wav format) is shown here:

Figure 1: Plot of 40 seconds audio from an electric guitar

If you zoom into the above plot, you can clearly see how the signal varies with time. Shown below is approximately 1/125 of a second (8 milliseconds) of the same audio.

Figure 2: Plot of 8 milliseconds audio from an electric guitar

And here is the audio of the above guitar recording:

This is the direct audio signal from a Fender Telecaster electric guitar. Not too exciting, is it? But if we send that same signal through a Blackstar HT40 tube amplifier in overdrive it becomes a bit more interesting:

In the following section we will attempt to create a real-time model of the HT40 amp at a particular setting using WaveNet. The amp is set at 25% gain using the overdrive channel and neutral EQ. The recording was made using a SM57 microphone, 1 centimeter from the speaker grill at approximately mid-cone.

Blackstar HT40 amp mic’d with an SM57 (Image by author)

Note that both the mic and the speaker/cabinet modify the audio signal from the amp’s electronics, and this setup may not be ideal for modeling the amp as accurately as possible. But given that this amp was rented, I imagine they preferred that I didn’t go opening the thing up (and neither should you unless you know what you’re doing! Vacuum tube electronics contain very high currents and voltages).

PyTorch Training

The example code for the PyTorch training comes from PedalNetRT on Github, and is written in Python. Note that PedalNetRT uses Pytorch-lightning, which is a helpful wrapper around PyTorch.

For training the neural network, we will examine the code from the “model.py” python file, which defines the WaveNet class implementation. The entire structure is composed of 1-D convolutional layers. The base WaveNet class, which inherits PyTorch’s nn.Module, is shown below:

The forward() function is where the audio processing occurs. A gated activation is used here, as defined in the guitar amp emulation paper². The output of the gated activation is fed into the self.residual layer stack. The input samplex is added back to the output, which is fed through the self.linear_mix layer to reduce the output to a single sample of audio.

The network receives the input audio samples, then abstracts features of sound in the internal layers, and outputs a single predicted audio sample. Remember, this is happening 44,100 times per second of data! The predicted sample is compared with the known audio (from the HT40 amp recording) and PyTorch determines how to adjust the network values to get closer to the target. Each pass through the 4 minutes of training data is one “epoch”. Typical training sessions can last 1500 epochs or more before an acceptable loss is reached.

The WaveNet class is initialized with several parameters defining the size and structure of the neural net. The basic structure consists of self.input_layer, self.hidden_layer / self.residuals (both of which are stacks of convolutional layers), and self.linear_mix (the output layer). You might have noticed that a custom CausalConv1d class is used rather than PyTorch’s built in nn.Conv1d. This is because causal convolution in PyTorch must be done manually (at the time of writing this). Causal convolution uses zero padding only on one side of the data, rather than “same” padding which pads both sides equally.

Note: Zero-padding the data is used to control the output size of the convolutional layer by adding extra zero’s to a particular dimension.

The custom CausalConv1d class is essentially a wrapper for nn.Conv1d that zero-pads the input data only on the left-hand side. The CausalConv1d class implementation is shown here:

The last piece to setting up the base WaveNet class is the _conv_stack function, which stacks the desired number of CausalConv1d layers. The number of layers in the stack is defined by the integer dilations . For dilations=4, you get a stack of four layers with “1, 2, 4, 8” as each dilation rate. For dilations=8 , you get a stack of 8 layers with “1, 2, 4, 8, 16, 32, 64, 128” dilation rates. The parameter num_repeat repeats the dilation layers. For example, for dilations=8 and num_repeat=2 , you get

“1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128”

for a total of 16 hidden convolutional layers. The _conv_stack function is shown below:

As mentioned earlier, typical training sessions with WaveNet can last for 1500 epochs before we get a good match to the target. More epochs may result in higher accuracy, depending on the complexity of the target signal. The general rule is that the more distorted the signal is (high gain for example), the more difficult it is to train. By choosing larger model parameters, one can improve training performance at the cost of processing time. The trade-off between training speed and real-time performance is important here, and is different for each target hardware.

It is also important to note that in the case of guitar pedals, this neural net method does not work for time-based effects such as delay / reverb / chorus / flange. It is effective on distortion/overdrive (i.e. effects where the dynamic response is less than about 50 milliseconds).

Training Results

Here are the results of a training session of the HT40 over 1500 epochs using PedalNetRT. The loss value went down to approximately 0.02. The loss function used here is a variation of MSE (mean-squared-error) defined in the amp emulation paper². A comparison plot of 8 milliseconds of data is shown here:

Figure 3: Comparison between actual signal from HT40 amp and the predicted signal

And this is the predicted audio (compare to actual HT40 audio from earlier):

The resulting loss of 0.02 is very close to the original recording. From the SoundCloud mp3 samples you’d be hard-pressed to tell which is which. With a trained ear, high-quality studio monitors, and the original .wav files, one could probably tell the difference, but I’d consider 0.02 loss a successful capture of the HT40 amp. Based on other tests, a 0.05 loss sounds good but is perceptually different, 0.10 loss is noticeably different but can sound close with extra EQ processing. A loss of 0.2 and higher can still be a fun sound to play, but based on experience I would consider that an unsuccessful capture.

The training time depends on the hardware, model size, and audio recording length. On a Nvidia Quadro 2000 GPU, a run of 1500 epochs takes about 4 hours. On a mid-priced laptop with built-in Nvidia graphics card, the same training takes about 8 hours. With CPU-only training, you’re looking at 24+ hours of training time.

Model Conversion

Before using the trained model in real-time, the model must be converted into a suitable format for loading into the plugin. The format chosen here is “json”, for it’s readability and general acceptance in the computing world. PyTorch uses it’s own “.pt” format, “.ckpt” for Pytorch-lightning. A script called “export.py” is used in PedalNetRT to perform this conversion and arrange the data in a way that WaveNetVA understands.

Real-Time Implementation

The example code for the real-time c++ comes from WaveNetVA on Github (also implemented in SmartGuitarAmp and SmartGuitarPedal). The code uses Eigen for matrix calculations.

The real-time plugin uses the JUCE framework, which is a cross platform c++ framework for creating audio applications. The basic goal is to recreate the forward() function from the PyTorch WaveNet class in high performance c++ code. We won’t cover all of the code here, but I will touch on the key points.

The model data (state parameters) from the converted json model are loaded and set in the WaveNet class. The setParams method is shown below, which sets up a WaveNet instance based on the json data. An example of a trained json model can be viewed on Github.

The inputLayer , outputLayer, and convStack (internal layers) are defined here. Each layer has settings for number of channels, filter width, dilations, and activation functions.

The main processing method of the WaveNet class is shown here, which takes the input audio buffer inputData, processes it through each layer (inputLayer, convStack, outputLayer), and then copies the processed audio to the output buffer, outputData.

The audio data flows in real-time (via audio buffers, or “blocks”) through each layer, but the actual convolution calculation is performed in the “Convolution.cpp” processSingleSample method, which processes a single sample of audio:

This method uses the layer settings from the json model to determine how the data is convolved, or how the data flows through the layer. At it’s most basic operation, this is simply a multiplication by the kernel (the trained neural net values):

outVec = outVec + *(fifo+readPos) * (*it);

and an addition of the bias vector:

outVec = outVec + bias

The complicated part is determining how to index each array so that the convolutional layers are calculated correctly. I’ll save that for another day!

Once the current audio buffer is processed (anywhere from 16 to 4096 samples by multiples of 2, depending on the audio device settings), it is converted back to analog and out your speakers or headphones. Luckily for us, the JUCE framework handles all of that.

Real-Time Performance

This WaveNet implementation is fully capable of running on any modern computer in real-time. However, compared to audio DSP software using traditional modelling, it has much higher CPU usage. The same quality of modelling (in the case of guitar amplifiers) can be achieved using circuit analysis at much faster processing speeds. It could be argued that that ability to model these complex systems with only audio samples and no domain expertise is a good trade-off.

Here is a real-time demo video of the SmartAmp plugin using WaveNet to model a small tube amplifer (Fender Blues Jr.) and other amps.

My perception of WaveNet as a guitarist is that it has a very natural sound when compared with the target amp/pedal. It seems to have trouble with high gain (as in metal music) but can handle mid-gain and clean tones accurately. A roll-off for low bass tones has been observed in the models as well. Changing the size of the model can have a big impact on the sound, as well as it’s ability to run smoothly in real-time.

In the next article we will investigate using a Stateless LSTM, to see if we can improve CPU usage and training time while maintaining high-quality sound. Continue reading here:

Neural Networks for Real-Time Audio: Stateless LSTM

This is the third of a four-part series on using neural networks for real-time audio. For the previous article on…

link.medium.com

Thank you for reading!

Aaron van den Oord et al., “Wavenet: A Generative Model for Raw Audio” arXiv Preprint arXiv:1609.03499, 2016.
Alec Wright et al., “Real-Time Guitar Amplifier Emulation with Deep Learning” Applied Sciences 10, no. 3 (2020): 766.