Text To Speech — Foundational Knowledge (Part 2)

Knowledge required to train, synthesize, and implement the latest TTS algorithms: part 2 in a zero-to-hero series on Machine Learning Audio utilizing ESPnet

Aaron Brown
Towards Data Science

--

Source: Giphy

Background:

In part 1, we went through a simple implementation of ESPnet TTS to showcase the ability of rapid synthesis. Still, if you were anything like me, you probably wondered how this all works behind the scenes! At first, I had a very tough time finding a consolidated article or source that did a halfway decent job explaining what’s happening behind the scenes. This post summarizes numerous survey/research papers, YouTube lecture videos, and blog posts. This post really couldn’t be possible without this excellent survey paper that provided many visuals and text. I’m attempting to “synthesize” my foundational knowledge findings here with the hope that this will allow you to understand better:

  • Basic Audio Terminology & Frameworks
  • Evolution of Text To Speech Algorithms
  • Deep Dive in Acoustic Models & Neural Vocoder Algorithms
  • Next Generation End-to-End TTS

Basic Audio Terminology:

Waveform:

A computer interprets an audio signal that changes amplitude over a fixed time frame. Each sample usually takes on 65,536 values (16-bits), and quality is measured in kHz. The cyclical components in a waveform are the principal components for the approaches well discussed below.

Source: Giphy (Audio Waveform)

Phonemes:

The text-to-speech engine doesn’t directly take characters as an input, but phonemes, specifically stressed ARPA (for English) as found in the CMU Dict. For example, green is G R IY1 N. Other languages may have different formats but the same usage. That is why we will convert our input text (during training) into phonemes that are distinct units of sounds that distinguish one word from another in a language. When your input is processed, it gets turned into phonemes by a neural network trained on known utterances, which also learns to generate spelling for novel words.

Spectrograms:

Deep learning models don’t take raw audio directly as input, so audio is converted into spectrograms, and Fourier transforms the source audio into the time-frequency domain. The transformation process chops up the duration of the sound signal into smaller signals before transformation then combines the output into a single view.

Example Spectrogram (Author)

The color in the figure above visually showcases the audio decibels. However, we can see that we aren’t capturing that much from this audio clip.

Mel-Spectrogram:

The human understanding of sound in terms of frequency can vary wildly by the impression of the frequency, and nature doesn’t perceive sounds linearly. This is the reason why the Mel Scale was developed. Its crux is it takes into account the Decibel Scale when dealing with amplitudes (how loud) and logarithmic scale for frequencies (pitch). Put the voice features are all stored in the Mel Spectrogram.

Example Mel-Spectrogram (Author)

We can see that the Mel-spectrogram provides a much clearer picture measured in decibels and optimized for input into our TTS. If you want a deeper mathematical understanding check out this great post here.

Audio Evaluation Metric:

The most common numeric metric that ML Engineers use to evaluate speech synthesis quality is the Mean Opinion Score (MOS), which ranges from 0 to 5, with an everyday human speech from 4.5 to 4.8. To check benchmarks, you can look here, but at least currently doesn’t include all TTS algorithms.

Source: Mean Opinion Score from Author

Autoregressive (AR):

Describes a model that predicts future values based on past data, which the underlying assumption is that the future will resemble the past. This is an important note when dealing in the audio domain since you’ll need to know what words the speaker is trained on to produce proper output audio. There is also a generalization trade-off of speed and quality when working between these two algorithms where an autoregressive generation has a lower rate, but higher quality and a non-auto regressive generation have the inverse.

Statistical Parametric Speech Synthesis (SPSS):

A Text To Speech approach for tackling the concerns of traditional concatenative TTS. This method synthesizes speech by generating the acoustic parameters required for speech and then recovering speech from the generated acoustic parameters using algorithms. The mainstream 2-Stage method framework is SPSS based.

Mainstream 2-Stage Framework:

As a review, TTS has evolved from concatenative synthesis to parametric synthesis to neural network-based, as described in part 1. In part 1, we broke down the Mainstream 2-Stage method, as can be reviewed below.

Source: Image By Author (Mainstream 2-Stage High-Level Architecture)

Text To Speech Frameworks:

Source: A Survey on Neural Speech Synthesis

The Figure above denotes the five various types of TTS frameworks that are utilized today. With stages 3 and 4 being the subject of this post.

The current mainstream 2-Stage (Acoustic Model + Vocoder) neural network-based models have significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2/FastSpeech 2) will first generate Mel-spectrogram from text and then synthesize speech from Mel-spectrogram using a neural vocoder such as WaveNet. There is also currently an evolution towards a next-generation fully End-to-End TTS model that we’ll discuss.

The Evolution Of Text To Speech:

Source: A Survey on Neural Speech Synthesis

This sider web figure above clearly denotes the field of Synthetic Text To Speech (TTS) utilizing neural networks has been exploding in terms of growth over just the past few years and how there is a recent trend away from the Mainstream 2-stage (Acoustic Model + Vocoder) towards Next Generation End-to-End Models.

In the data flow figure below we can see how the different neural based TTS algorithms start with raw text (characters) and generate waveforms. The figure above and below here both greatly help contextualize the evolution of the algorithms and there corresponding data flows.

Source: A Survey on Neural Speech Synthesis

Acoustic Algorithms & Neural Vocoder Deep Dive:

Source: Giphy

Acoustic Algorithms:

There are currently three main architecture types that the current acoustic algorithms use:

Recurrent Neural Network (RNN)

Source: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

RNN is the acoustic model framework for algorithms such as the non-auto-regressive TacoTron 2 and is a recurrent sequence-to-sequence feature prediction network with the attention that predicts a sequence of Mel-spectrogram frames from an input character sequence. A modified version of WaveNet generates time-domain waveform samples conditioned on the predicted Mel-spectrogram frames. TacoTron 2 architecture was a massive leap forward in improving voice quality over the other methods such as concatenative, parametric, and autoregressive TacoTron 1. The architecture for TacoTron 2 is shown above.

Convolutional Neural Network (CNN)

Algorithms like DeepVoice 3 leverage a fully-convolutional network structure for speech synthesis, generating Mel-spectrograms from characters and scaling up to real-word multi-speaker datasets. This class of acoustic algorithms is similar to mainstream CNNs classify dogs vs cats by training on images of each class, but in this instance its training on Mel-Spectrograms. DeepVoice 3 improves over previous DeepVoice 1/2 systems by using a more compact sequence-to-sequence model and directly predicting Mel-spectrograms instead of complex linguistic features.

Transformer

Source: Microsoft Research Blog FastSpeech2 by Xu Tan

Transformer-based (self-attention) acoustic algorithms like FastSpeech 1/2 leverage transformer-based encoder-attention-decoder architecture to generate Mel-spectrograms from phonemes. They are a derivation of transform networks which is an architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformer-based algorithms dramatically speed up speech synthesis via a feed-forward transformer network to generate the Mel Spectrograms in parallel in contrast to the other models (ex TacoTron 2) that have autoregressive encoder attention decoders. Most notably, the algorithm simplifies the output by removing the Mel-spectrograms as an intermediate output completely and directly generates speech waveform from the text during inference, enjoying the benefit of complete end-to-end joint optimization in training and low latency in inference. FastSpeech 2 achieves better voice quality than FastSpeech 1 and maintains the advantages of fast, robust, and controllable speech synthesis by utilizing transformer-based architecture; this can be visualized in the FastSpeech 2 figure above, and importantly take note of the variance adaptor portion as being the main differentiator when using FastSpeech 2 as opposed to other acoustic algorithms/frameworks.

Neural Vocoders Algorithms

Are used to transform the acoustic model output spectrogram into our target audio waveform (aka synthetic speech). Non-autoregressive models are the most promising, but they are not as good as autoregressive models.

There are essentially four main types of vocoders:

  • Autoregressive: WaveNet was the first neural network-based vocoder that dilates convolutions to generate waveform points autoregressively. The novel part about this approach is its ability to leverage almost no prior understanding about input audio signals and instead relies on end-to-end learning.
  • Flow-based: A generative model that transforms a probability density with a sequence of invertible mappings. This approach has two distinct frameworks, one that utilizes autoregressive transformations and the other bipartite transformations.
  • GAN-based: Modeled after typical Generative adversarial networks (GANs) used for image generation tasks. Any GAN framework consists of a generator for data generation and a discriminator to judge the generator’s data. The analogy I always think of is robbers and cops with robbers always trying to think of new schemes/attacks and cops trying to thwart them. Most of the current GAN-based vocoders will utilize dilated convolution to increase the receptive field to model the long-dependency in waveform sequence and transposed convolution to upsample the condition information to match the waveform length. On the other hand, discriminators focus on designing models to capture the waveform characteristics to provide a better guiding signal for the generator. The loss function improves the stability and efficiency of adversarial training and improves audio quality. As seen in the table below, many modern neural vocoders are GAN-based and will use various approaches with the Generator, Discriminator, and Loss function.
Source: A Survey on Neural Speech Synthesis

Diffusion-based: Utilize denoising diffusion probabilistic models for vocoders with the intuition being that the mapping between data and latent distributions with diffusion process and reverse process. Current research shows that this class of vocoders produces high-quality speech but requires a significant amount of time to conduct inference.

The image below showcases all the types of neural vocoders and their architecture.

Source: A Survey on Neural Speech Synthesis

Next Generation End-to-End TTS

Source: Image By Author (Next Generation End-to-End TTS Framework)

The current research and evolution of TTS as defined in the evolution figure is moving towards End-to-End TTS, though it hasn’t hit critical mass yet due to some current quality limitations and training time resource requirements compared to the mainstream 2-Stage method. With that said, End-to-End TTS has some distinct advantages in comparison to SPSS:

  • Dramatically reduces development and deployment costs.
  • Having a join architecture and end-to-end optimization means lessening error propagation in the current mainstream 2-stage method.
  • Requires less overall human annotation and feature development.
  • Conventional acoustic models require alignments between linguistic and acoustic features, while sequence to sequence-based neural models implicitly learn the alignments through attention or predict
  • the duration jointly, which are more end-to-end and require less preprocessing
  • As the increasing modeling power of neural networks, the linguistic features are simplified into only character or phoneme sequences, and the acoustic features have changed from low-dimensional and condensed
Source: A Survey on Neural Speech Synthesis

In the figure above we can see an overview of the current End to End TTS algorithms proposed though not all have their code open source.

FastSpeech 2s is deployed to Microsoft Azure Managed TTS service, and for me, this proves out the future state of the field clearly in an applied commercial form. Luckily for us, Open Source ESPnet 2 has Conditional Variational Autoencoder with Adversarial Learning (VITs) available now for use, and I plan to cover it practically in a future post.

What's Next?

Now that we have gotten our feet wet in part 1 doing some simple TTS with EspNet 2 and just got an excellent foundational understanding of acoustic algorithms, neural vocoders, and overall TTS architecture, let’s apply it! The future posts will go over how to conduct speech enhancement, Automatic Speech Recognition, and training our voice within ESPnet!

Source: Giphy

References

As stated initially, this post wouldn’t have been possible if it wasn’t for the precise wording, description, and photos that the Microsoft Research Asia team produced in “A Survey on Neural Speech Synthesis”.

--

--