FastSpeech: Paper Overview & Implementation

Learn about text-to-speech and how it’s realized by transformers

Essam Wisam
Towards Data Science

--

In 2019, FastSpeech has pushed the frontier of neural text-to-speech by offering significant improvement in inference speed while maintaining robustness to prevent word repetition or omission. It also allowed for controllability of the output speech in terms of speech and prosody.

In this story, we aim to familiarize you with how transformers are employed for text-to-speech, provide you with a concise overview of the FastSpeech paper and point you to how you can implement it from scratch. In this, we will assume that you are familiar with transformers and their different components. If not, we highly recommend reviewing the preceding article that delves into this topic.

Van Gogh-style Painting Featuring a Transformer Speaking into a Microphone at a Podium — Generated by Author using Canva

Table of Contents

· Background
Introduction
Mel Spectrogram
· Paper Overview
Introduction
Experiments and Results
Architecture
Encoder
Length Regulator
Decoder
· Implementation
Strategy
Full Implementation

Background

Introduction

Traditional text-to-speech (TTS) models relied on concatenative and statistical techniques. Concatenative techniques relied on synthesizing speech by concatenating sounds from a database of phoneme sounds (distinct units of sound in the language). Statistical techniques (e.g., HMMs) attempted to model basic properties of speech that are sufficient to generate a waveform. Both approaches often had issues with producing natural sounds or expressing emotion. In other words, they tend to produce unnatural or robotic speech for the given text.

The quality of speech has been significantly improved by using deep learning (neural networks) for TTS. Such methods usually consist of two main models: the first takes in text and outputs a corresponding Mel Spectrogram and the second takes in the Mel Spectrogram and synthesizes speech (called a vocoder).

Mel Spectrogram

Spectrogram by The Official CTBTO Photostream on Flickr CC BY-SA 2.0.

In its most basic form, a speech waveform is just a sequence of amplitudes that represent the variations in air pressure over time. We can transform any waveform into a corresponding Mel Spectrogram (which is a matrix indicating the magnitude of different frequencies at different time windows of the original waveform) using the short-time Fourier transform (STFT). It’s easy to map a piece of audio to its Mel Spectrogram using the short-time Fourier transform; however, doing the inverse is quite harder and the best systematic methods (e.g., Griffin Lim) can yield coarse results. A preferred approach is to train a model for this task. Existing models trained for this task include WaveGlow and WaveNet.

Thus, to reiterate, deep learning methods often approach text-to-speech by training the model to predict the MelSpectrogram of speech corresponding to many instances of text. It then relies on another model (called vocoder) to map the predicted spectrogram to audio. FastSpeech uses the WaveGlow model by Nvidia.

A Happy Transformer Writing a Research Paper, Painted in Van Gogh Style. — Generated by Author using Canva

Paper Overview

Introduction

Although recent transformer-based TTS methods have drastically improved speech quality over traditional methods, there still remains three main issues with these models:

  • They suffer from slow inference speech because the transformer’s decoder is autoregressive. That is, they generate chunks of the Mel Spectrogram sequentially relying on previously generated chunks. This also holds for older deep learning models based on RNNs and CNNs.
  • They are not robust; word skipping, or repetition may occur due to small errors in attention scores (aka alignments) that propagate during sequential generation.
  • They lack an easy way to control features of the generated speech such as speed or prosody (e.g., intonation).

FastSpeech attempts to solve all three issues. The two key differences from other transformer architectures are that:

  • The decoder is non-autoregressive; it’s perfectly parallelizable; hence, solving the speed issue.
  • It uses a length regulator component just before the decoder that attempts to ensure ideal alignment between phonemes and the Mel spectrogram and drops the cross-attention component.
  • The way the length regulator operates allows easy control of speech speed via a hyperparameter. Minor properties of prosody such as pause durations can be also controlled in a similar fashion.
  • In return, for purposes of the length regular, it uses sequence-level knowledge distillation during training. In other words, it relies on another already trained text-to-speech model for training (Transformer TTS model).

Experiments and Results

The authors used the LJSpeech dataset which includes audio length of about 24 hours scattered through 13100 audio clips (Each comes with its corresponding input text). The training task is to input the text and have the model predict the corresponding spectrogram. About 95.6% of the data was used for training and the rest was split to be used for validation and testing.

  • Inference Speed Up
    It increases the inference speed by 38x (or 270x without including the vocoder) compared to autoregressive transformer TTS models; hence, the name FastSpeech.
  • Audio Quality
    Using the mean opinion score of 20 native English speakers, the authors have shown that FastSpeech closely match the quality of the Transformer TTS model and Tacotron 2 (state-of-the-art at the time).
  • Robustness
    FastSpeech outperformed Transformer TTS and Tacotron 2 with a zero-error rate (in terms of skips and repetitions) on 50 challenging text-to-speech examples, compared to 24% and 34% for Transformer TTS and Tacotron 2 respectively.
  • Controllability
    The authors presented examples to demonstrate that speed and pause duration control work.
  • Ablation
    The authors confirm the effectiveness of decisions like integrating 1D convolutions in the transformer and employing sequence-level knowledge distillation. They reveal performance degradation (in terms of the mean opinion score) in the absence of each decision.

Architecture

FastSpeech Architecture Figure from the FastSpeech paper

The first figure portrays the whole architecture which consists of an encoder, length regulator, and decoder:

The Feedforward Transformer (FFT) block is used in both the encoder and the decoder. It is similar to the encoder layer in the transformer but swaps out the position-wise FFN for 1D convolution. A hyperparameter N represents the number of FFT blocks (connected sequentially) in the encoder and decoder. N is set as 6 in the paper.

The length regulator adjusts the sequence lengths of its inputs based on the duration predictor (third figure). The duration predictor is a simple network shown in the fourth figure.

You should be able to intuit that the data flow then takes the following form:

Encoder

The encoder takes a sequence of integers corresponding to characters given in the text. A grapheme-to-phoneme converter can be used to convert the text into a sequence of phonetic characters as mentioned in the paper; however, we will simply use letters as the character unit and assume that the model can learn any phonetic representation it needs during training. Thus, for an input “Say hello!”, the encoder takes a sequence 10 integers corresponding to[“S”,”a”,”y”,…,”!”].

Similar to the transformer encoder, the purpose of the encoder is to assign each character a rich vector representation that takes into account the phonetic character itself, its order, and its relationship with the other ones in the given text. Similar to the transformer, it maintains the dimensionality of the assigned vectors in the encoder for Add & Norm purposes.

For an input sequence with n characters, the encoder outputs [h₁,h₂,…,hₙ] where each representation has dimensionality emb_dim.

Length Regulator

The purpose of the length regulator is simply to repeat the encoder representation given to each character. The idea is that the pronunciation of each character in the text generally corresponds to multiple (or zero) Mel-spectrogram units (to be generated by the decoder); it’s not just one unit of sound. By a Mel-spectrogram unit, we mean one column in the Mel Spectrogram, which assigns a frequency distribution of sound to the time window corresponding to that column and corresponds to actual sound in the waveform.

The length regulator operates as follows:

  1. Predict the number of Mel Spectrogram units of each character.
  2. Repeat the encoder representation according to that number.

For instance, given the encoder representations [h₁, h₂, h₃, h₄, h] of input characters corresponding to “knight”. The following happens in inference time:

  1. The length regulator passes each representation to the duration predictor which uses the representation (which involves the relationships with all other characters in the text) to predict a single integer that represents the number of Mel Spectrograms for the corresponding character.
  2. Suppose the duration predictor returns [ 1, 2, 3, 2, 1] then the length regulator repeats each hidden state according to the predicted duration which yields [h₁, h₂, h₂, h₃, h₃, h₃, h₄, h₄, h]. Now we know the length of the sequence (10) is the length of the Mel Spectrogram.
  3. It passes this new sequence to the decoder.

Note that in a real setting, passing knight to FastSpeech and inspecting the output of the duration predictor yielded [ 1, 8, 15, 3, 0, 17]. Notice that the letters k , g , h contribute negligibly to the Mel Spectrogram compared to other letters. Indeed, what’s really pronounced when that word is spoken is mostly the n , i , t.

Controllability
It’s easy to control the speed by scaling the predicted durations. For example, if [ 1, 8, 15, 3, 0, 17] is doubled, it will take twice the time to say the word knight (0.5x speed up) and if it’s multiplied by half (then rounded) then it will take half the time to speak the word (2x speed up). It’s also possible to change only the duration corresponding to specific characters (e.g., spaces) to control the duration of their pronunciation (e.g., pause duration).

Training

In training, FastSpeech doesn’t predict durations using the duration predictor (it’s not trained) and rather predicts the duration using the attention matrices of a trained TTS Transformer.

  • Cross-attention in that transformer associates each character and Mel Spectrogram unit with an attention score via an attention matrix.
  • Thus, to predict the number of Mel Spectrogram units (duration) of a character c during the training of FastSpeech, it counts the number of Mel Spectrogram units that had maximum attention towards that character using the cross-attention matrix in the TTS Transformer.
  • Because cross-attention involves multiple attention matrices (one for each head), it does this operation on the attention matrix that is most “diagonal”. It could be that this ensures realistic alignment between the characters and Mel Spectrogram units.
  • It uses this duration to train the duration predictor as well (simple regression task). This way we don’t need this teacher model during inference.

Decoder

The decoder receives this new representation and aims to predict the frequency content (vector) of each Mel Spectrogram unit. This is tantamount to predicting the entire spectrogram corresponding to the text which can be transformed to audio using a vocoder.

The decoder follows a similar architecture to the encoder. It simply replaces the first block (embedding layer) with a linear layer as the last block. This layer is what produces the frequency vectors for each Mel Spectrogram unit using complex feature representations that earlier FFT blocks in the decoder have formed for the Mel Spectrogram units.

The number of frequencies n_mels is a hyperparameter of this layer. Set as 80 in the paper.

A Modern Futuristic Transformer Programming a Computer, Painted in Van Gogh Style — — Generated by Author using Canva

Implementation

FastSpeech Architecture Figure from the FastSpeech paper

Strategy

The FastSpeech architecture corresponds to

We will start with implementing:

and

then we can implement the encoder and decoder as their composition is

Now all we need is the length regulator

because once done the last step is

Full Implementation

To avoid spamming this article with a lot of code, I had earlier prepared an annotated notebook with an organized, code-optimized and learning-friendly version of an original implementation, for inference purposes. You can find it on Github or Google Colab. It is highly recommended that you understand the different components found in the transformer architecture before jumping into the implementation.

A Modern Futuristic Jet Flying Towards the Stars Painted in Van Gogh style — Generated by Author using Canva

I hope the explanation provided in this story has been helpful in enhancing your understanding of FastSpeech and its architecture, while guiding you on how you can implement it from scratch. Till next time, au revoir.

--

--