Making Sense of Big Data
.
In this story I want to advance your current understanding of neural upsamplers in the context of audio synthesis. And provide a simple Subpixel1D Keras layer implementation to use as a drop-in replacement for many of the tasks we discuss today.
We all know that up- and down sampling is an important operation in Deep Learning for computer vision, e.g., in tasks like image super resolution or image generation. The same holds true for audio synthesis using popular architectures like GANs, U-Nets or Auto-encoder. While downsampling is a relatively simple operation, there always have been difficulties finding a good upsampling strategy which doesn’t result in image or audio artifacts. For a primer on 2-dimensional checkerboard artifacts in computer vision tasks read this great post [1].
Now let us dive deeper into 1-dimensional audio upsampling. In the audio domain we use three main upsampling techniques [2]:
- Transposed convolutions (widely used)
- Interpolation + convolution (often used)
- Subpixel convolutions (rarely used but prominent in vision tasks)
Examples of their usage can be found in many publications, like, Demucs (music source separation) [3], MelGAN (waveform synthesis) [4], SEGAN (speech enhancement) [5], Conv-TasNet (speech separation) [6] or Wave U-Net (source separation) [7].
TensorFlow Keras provides a fourth solution to upsampling which is the UpSample1D layer, however, as of now (March 21) this layer still is outrageously slow on GPU, although the issue is closed.
Transposed Convolutions
Transposed convolutions are the inverse operation to a convolution and Keras provides these as 1D, 2D and 3D implementations. You can use the Conv1DTranspose layer similarly as the Conv1D counterpart, and by passing strides=2 the resulting tensor will have doubled the size of its time dimension, as depicted in Figure 1. In this case, the upsampling is governed by the weights of the convolutional operation which are learned during training.
Interpolation
In comparison, interpolation itself has no learnable parameters which is why such an operation should be followed by a neural net layer. Or else we would only up sample the high-level features of some latent space. Computational performance may vary a lot depending on the interpolation scheme as well as the results when deployed in a neural net. Later we will see why the other approaches are favorable but here is a simple way to implement the interpolation operation in 1D:
x = tf.image.resize(x, [samples * stride, 1], method='nearest')
Subpixel
This is a cool one! The idea is to carry out the upsampling in the channel dimension instead of the time dimension. By doubling the size of the channels with a convolution we can then apply a reshape operation which periodically shuffles channels into the time dimension. This is better shown than explained, so please have a look at Figure 3. There are different implementations of the Subpixel layer out there which also vary in computational complexity.
In the following sections we will have a closer look into the implementation of a Subpixel1D layer but let us first talk about the pro’s and con’s of the three methods so far.
Upsampling Artifacts
All the above mentioned upsampling methods will introduce artifacts into a neural audio synthesis model, however, in theory your model could decide to learn a way to minimize these artifacts. In a recent preprint by Jordi Pons et al. [2] the team describes how neural upsamplers introduce tonal and filtering artifacts. If you want to dive deeper into the subject definitely give it a read, it is highly recommended! However, to rank the bespoken methods for practical usage here is a summary:
- Transposed Convolution: partially overlapping filters lead to stronger tonal artifacts, and therefore you should parameterize filter+stride for full-overlap or no-overlap
- To avoid spectral replicas don’t use ReLu-activation and remove any biases in convolutional layers.
- Interpolation: use nearest neighbor interpolation instead of linear interpolation. Interpolation methods induce filtering artifacts.
- Subpixel and Transposed Convolution show 25% faster training in a Demucs-like architecture. They also achieve the best signal-to-distortion scores.
- Filtering artifacts are perceptually less annoying than tonal artifacts
- Currently, learning from data is the only way to overcome tonal artifacts, especially the kind that is introduced through random weight initialization
The analysis suggests that transposed convolutions and Subpixel CNNs are the way to go in neural audio synthesis. They achieve better SDR scores and are computational more efficient than using image interpolation methods. However, interpolation models seem to generalize better to unseen data, introducing only filtering artifacts which are perceptually less annoying.
Implementing Subpixel1D
This implementation of Subpixel1D utilizes the _tf.batch_tospace() function to perform the periodic shuffle. Therefore, we first permute the dimensions so that the channel dimension is first, after applying the batch_to_space operation we only need to permute the dimensions back in place to obtain our up sampled tensor. We assume the input to the Subpixel1D layer already underwent a convolutional layer beforehand which increased the channel dimension appropriately. In case the channel dimension size is not evenly divisible by the upsampling factor r, an error will be raised.
Now let us put this layer into action by implementing a simple auto-encoder for raw audio data. We assume the input to our model is 16384 samples long, so we can easily down- and up-sample the layers by a factor of four.
Finally, we can put our neural upsampler to work and encode arbitrary audio data into the latent space of our auto-encoder. Of course, you can replace this model-sketch with any of the models mentioned in the publications above. Looking at the model summary in Figure 4 we see that the proposed architecture uses convolutional layers only. This is great since it keeps our model simple and performant, your training times should improve significantly if you were using interpolation methods before!
Congratulations
You are now prepared to build state-of-the-art neural upsamplers for audio synthesis tasks (and any other time-series data for that matter). By deciding which attributes of the presented upsampling methods are more desirable for your project, you should be able to circumnavigate the pitfalls of perceptually annoying artifacts. In addition, we have implemented a Subpixel1D layer which provides the best computational performance in a CNN architecture while showing the same behavior as a transposed convolution.
Resources
[1] A. Odena et al., Deconvolution and Checkerboard Artifacts (2016), http://distill.pub/2016/deconv-checkerboard
[2] J. Pons et al., Upsampling Artifacts in Neural Audio Synthesis (2021), https://arxiv.org/pdf/2010.14356.pdf
[3] A. Défossez et al., Music Source Separation in the Waveform Domain (2019), https://hal.archives-ouvertes.fr/hal-02379796/document
[4] K. Kumar et al., MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (2019), https://arxiv.org/pdf/1910.06711.pdf
[5] S. Pascual et al., SEGAN: Speech Enhancement Generative Adversarial Network (2017), https://arxiv.org/pdf/1703.09452.pdf
[6] Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation (2019), https://arxiv.org/pdf/1809.07454.pdf
[7] D. Stoller, S. Ewert and S. Dixon, Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation (2018), https://arxiv.org/pdf/1806.03185.pdf
If you are reading this story, we probably share similar interest or are in the same industry in which case you are welcome to contact me. Find me on LinkedIn.