Text To Speech — Lifelike Speech Synthesis Demo (Part 1)

The start of a zero to hero series utilizing ESPnet. Covering the current state of audio focused machine learning, this first post covers the background, architecture and a TTS demo.

Aaron Brown
Towards Data Science

--

Source: The Opposition from Giphy

Skip to Enough Talk Just Demo!if you want to view the code

Background

The Machine Learning domain of Audio is definitely at the cutting edge right now. A majority of the applications that products offer today are proprietary. There are many audio-specific open-source frameworks and algorithms that the community is developing. My goal over the next few articles is to give a deeper dive into some practical end-to-end uses of Audio. From Speech to Text and Text To Speech to Voice Cloning. As the title mentioned, this post will focus on a simple implementation of Text To Speech (Speech synthesis) and serve as Part 1 of this series.

Text To Speech Use Cases

  • Personal Virtual Assistant
  • Creating your own audiobook/podcast
  • Speech-enabled websites
  • Unique NPC Game Player Voices
  • Free open source way to help users with speech disabilities communicate freely
  • Computer Literacy Support Tools

Modern Text To Speech History

Source: Youtube Fist Computer To Singe “Daisy Bell”

I came across a few open-source Text To Speech frameworks. After a fair amount of research/experimentation, only one proved fully open-source, extensible, and easy to integrate into applications.

Using computers to synthesize speech isn’t new. The first successful use case of Text To Speech synthesis occurred using a hulky room sized IBM computer at Bell Labs in the late 1960s when researchers recreated the song “Daisy Bell,” but the audio quality remained an issue for decades. It wasn’t until the modern revolution of Machine Learning and advances in Deep Neural Networks (DNNs) that this domain was transformed, and algorithms could give rise to human-sounding synthetic speech. A host of new audio use cases is now possible and scalable.

Training your own custom voices for your audio use cases is extremely GPU intensive due to the DNN requirements, taking days or even weeks. This is one barrier currently to mass adoption and integration by the community. As we’ll explore in this demo, however with a pre-trained voice, you can get results fast!

Upon initial research, you will first come across the widely popular SV2TTS from Corentin Jemine, and it’s interesting for a glimpse into what’s possible. It isn’t fully customizable, and if your use case is voice cloning, you will get subpar results. Though it’s a great example of an open-source tool that’s backbone was commercialized into resemble.ai’s main product.

Since the mainstream 2-stage method of TTS follows a similar structure to a Convolutional Neural Networks (CNN) model, in short, a “Mel-Spectrogram” (image of Audio) can be labeled with text which allows for classification tasks for Audio. We are essentially using the same deep learning techniques that modern-day CNNs use to classify cats vs. dogs and now using audio images to classify sound, aka Mel-spectrograms.

With the more recent push to Neural Net TTS frameworks, I came across ESPnet2 and Tensorflow TTS.

There could be something new released when you read this article, so I recommend constantly reviewing the audio domain in papers with code.

This path lead me to the following Machine Learning Audio frameworks:

TensorFlow TTS:

  • Solely a Text To Speech application optimized for ease.
  • Ability to speed-up training/inference progress, optimized further by using quantization aware and pruning leading to near-real-time results. To confirm, take a look at Ahsen Khaliq Hugging faces link set up here.
  • Built to support real-time speech synthesis.

ESPnet:

  • An end-to-end speech processing toolkit that includes speech recognition and synthesis. This gives a unified neural model architecture that leads to a straightforward software design for Machine Learning Engineers.
  • Has a built-in Automatic Speech Recognition (ASR) mode based off of the famous Kaldi project
  • Extensive algorithm support with End-to-End TTS Algorithms.
  • Largest TTS Open Source active community.
  • Massive language support
  • Novel ASR Transformers that give boosted performance

Both projects are open source under Apache-2.0 License and free to use :)

If $$$$ is no barrier, you can always go with a paid managed service TTS from any Cloud Service Providers (CSPs).

I elected to work with ESPnet due to its more extensive documentation, community support, and end-to-end framework. With that said, TensorFlow TTS could be the better route if you are solely looking for a more lightweight TTS architecture.

ESPnet1 to ESPnet2

Digging around the ESPnet repo, you’ll likely be confused since it holds ESPnet1 and ESPnet2…have no fear! ESPnet2 is the latest release for DNN training and has the following upgrades:

  • Now independent from Kaldi and Chainer, unlike ESPnet1.
  • Feature extraction and text processing in parallels during training
  • Improved software workflow by enhancing the continuous integration, enriching documentation, supporting docker, pip install, and model zoo functions.
  • For beginners, it’s best to utilize ESPnet2, and from my understanding, ESPnet1 is more customizable.

Text To Speech Architecture Types

We must understand the different types of architectures that we can utilize to synthesize speech along with the current evolution.

Concatenative — Old School

Traditional old school technique that uses a stored speech database where speech is mapped to specific words. While for certain mapped words, you can produce understandable Audio, the output speech will not include the natural sounds of voice, “prosody, emotion, etc..”

Mainstream 2-Stage:

A hybrid parametric TTS approach that relies on a Deep Neural Network consisting of an acoustic model and neural vocoder to approximate the parameters and relationship between input text and the waveform that make up speech.

A basic high-level overview of mainstream 2-Stage TTS System

Source: Image By Author (Mainstream 2-Stage High Level Architecture)

Text Preprocessing and Normalization:

  • Simply the precursor step for the input text. It will be converted into the target language linguistic features in the form of a vector inputted into the acoustic model.
  • Convert input text into a format that ESPnet can interpret. This is done through normalization (e.g. Aug to August) and converting to phonemes via grapheme-to-phoneme conversion (e.g. August to 2fadaset).

Acoustic Model:

  • Algorithms are optimized to convert the preprocessed/normalized text into Mel-spectrograms as output.
  • For a majority of the algorithms you need to convert the linguistic features vector to acoustic features to a Mel-Spectrogram. The Spectrogram ensures that we have now accounted for all relevant audio features.

Neural Vocoder:

  • The input for the final step is the Mel-Spectrograms that are translated into a waveform via the Neural Vocoder.
  • While there are many different types of neural vocoders, the modern ones today have a GAN foundation.

Next Generation End-to-End Text to Wave Model:

The recent papers in Audio TTS are heading in this direction. Utilizing a single acoustic model that doesn’t output Mel-spectrograms that feed a neural vocoder.

Source: Image By Author (Next Generation End-to-End Architecture)

Overview:

  • Directly predict sequences without generating intermediate representations or Mel-spectrograms anywhere, thereby removing the need for a Neural Vocoder.
  • This dramatically simplifies the model architecture and training required with the objective of fast waveform generation.
  • During the time of writing, ESPnet only supports Conditional Variational Autoencoder with Adversarial Learning (VITs).

How?

  • By training a DNN model to predict the sequence of waveform blocks (1-D target signal cut into no overlapping segments) instead of the entire waveform that the 2-Stage process does.
  • Increased training speed comes from the block-auto regressive waveform generation. Each step generates a new block in parallel instead of the regressive nature of traditional neural vocoders like WaveRNN.

Enough Talk Just Demo!

We will review the Text To Speech single speaker example for English using the widespread LJSpeech (Female US Speaker) pretrained model.

There are other languages, such as Chinese Mandarin, Japanese, etc., that you can utilize. I’d recommend checking out hugging face to find more ESPnet pretrained models if you are for something fast and easy.

You can follow along through Google Colab ESPnet TTS Demo or locally. If you want to run locally, Ensure that you have a CUDA compatible system.

Step 1: Installation

Install from terminal or through Jupyter notebook with the prefix (!)

Step 2: Download a Pre-Trained Acoustic Model and Neural Vocoder

Experimentation! (This is the fun part)

Next Generation End-to-End (Text2wav) Model:

Remember this is the latest and greatest! You don't need a Neural Vocoder for this algorithm.

  • Conditional Variational Autoencoder with Adversarial Learning (VITs)

Mainstream 2-Stage (Text2mel) Models:

Recall these are the sturdy acoustic models that will output Mel-Spectrograms

  • Tacotron2
  • Transformer-TTS
  • (Conformer) FastSpeech
  • (Conformer) FastSpeech2

Neural Vocoder:

Will take the Mel-Spectrograms and decode it into waveforms (Audio)

  • Parallel WaveGAN
  • Multi-band MelGAN
  • HiFiGAN
  • Style MelGAN.

The framework below links through tags, and replace the Pre-Trained model you wish to execute. In part 2 of this series we’ll explain in depth what all this mumbo jumbo in terms of acoustic algorithms above mean.

Step 3: Model Setup

Initialize your ESPnet Model with the selected pretrained acoustic model and neural vocoder (if selected). There are some hyperparameters to tune for some acoustic algorithms, but we’ll get more into that in the next post. For now use the default.

Important input notes and caveats:

  • Depending on what pretrained language/model you use and input text will determine the quality of your target audio.
  • If you use words that pretrained model hasn’t been trained on, you will get a subpar result for that word.
  • Adding ?! will add context to your speech which will increase the realism.
  • You’re able to add commas that will add a “natural pause” in the speech.
  • While there isn’t a clear limit on text input, your RTF score will decrease over time with a massive text block. It’s better to work in more sizeable chunks.

Step 4: Speech Synthesis

Source: Giphy

Hopefully, this part speaks for itself, but simply place whatever text you wish to transform into beautiful Audio!

Finally, you’ve made it! The Relative Transfer Function (RTF) is an audio output quality metric on a scale between 0 to 1, with your goal of producing audio waveforms as close to 1 as possible.

Every domain of Machine Learning requires experimentation in some form or fashion. With Audio definitely not being an exception…I will say this is more fun listening and comparing algorithm generations.

Again credit goes to the authors of ESPnet along with Shinji Watanabe.

ESPnet TTS Single Speaker LJSpeech Full Demo

The assumption is that you have already installed ESPnet and its required dependencies. Remember to have a machine that has the correct system requirements!

Source: Gladiator from Giphy

What's Next?

While we were able to generate audio based target output maybe….my guess is you aren’t 100% satisfied. In future posts I’ll outline TTS foundational knowledge, Automatic Speech Recognition (ASR), Speech Enhancement, training on our own custom voice. Proving the extraordinary capability of this architecture and its ability to easily be deployed to support a host of awesome audio machine learning use cases.

References

--

--