State of the Art Audio Data Augmentation with Google Brain’s SpecAugment and Pytorch

Implementing SpecAugment with Pytorch & TorchAudio

Zach C
Towards Data Science
3 min readMay 1, 2019

--

Google Brain recently published SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition, which achieved state of the art results on various speech recognition tasks.

Unfortunately, Google Brain did not release code and it seems like they wrote their version in TensorFlow. For practitioners who prefer Pytorch, I’ve published an implementation of SpecAugment using Pytorch’s great companion library torchaudio and some functionality borrowed from an ongoing collaboration with other FastAI students: fastai-audio.

SpecAugment Basics

In speech recognition, raw audio is often transformed into an image-based representation. These images are typically spectrograms, which encode properties of sound in a format that many models find easier to learn.

Instead of doing data augmentation on raw audio signal, SpecAugment borrows ideas from computer vision and operates on spectrograms. SpecAugment works. Google Brain reports fantastic results:

SOTA results using SpecAugment

SpecAugment features three augmentations.

Time Warp

time warping a spectrogram

Put simply, Time Warp shifts the spectrogram in time by using interpolation techniques to squeeze and stretch the data in a randomly chosen direction.

Time Warp is SpecAugment’s most complex and computationally expensive augmentation. Deep learning engineer Jenny Cai and I worked through Tensorflow’ssparse_image_warp functionality until we had Pytorch support.

If you’re interested in the nitty-gritty details, you can check out SparseImageWarp.ipynb in the repo. Google Brain’s research suggests that Time Warp is the least effective of the augmentations so, if performance is an issue, you might consider dropping this one first.

Frequency and Time Masking

Frequency Masking and Time Masking are similar to the cutout data augmentation technique commonly used in computer vision.

Put simply, we mask a randomly chosen band of frequencies or slice of time steps with the mean value of the spectrogram or, if you prefer, zero.

With time on the X axis and frequency bands on the Y axis, here’s what Time Masking looks like:

time masking a spectrogram

And here’s Frequency Masking:

frequency masking a spectrogram

Naturally, you can apply all three augmentations on a single spectrogram:

All three augmentations combined on a single spectrogram

Hopefully these new Pytorch functions will prove useful in your deep learning workflows. Thanks for reading!

--

--

Technical: Node, React, Serverless, GraphQL and more… | Human: focus, optimism, minimalism | https://zach.dev