Speech Synthesis as a Service

MLaaS Part 2: Speaker on the wall, who’s got the best voice of them all?

Sebastian Kwiatkowski

Published in

Towards Data Science

14 min readJun 16, 2018

Part 1: Sentiment Analysis
Part 2: Speech Synthesis

Natural-sounding robotic voices

With the increasing performance of text-to-speech systems, the term “robotic voice” is likely to be redefined soon.

One improvement a time, we will come to think of speech synthesis as a complement and, occasionally, as a competitor to human voice-over talents and announcers.

The publications describing WaveNet[1], Tacotron[2], DeepVoice[3] and other systems are important milestones on the way to passing acoustic forms of the Turing test.

Training a speech synthesizer, however, can still be a time-consuming, resource-intensive and, sometimes, outright frustrating task. The issues and demos published on Github repositories focused on replicating research results are a testimony to this fact.

In contrast, all of the cloud computing platforms covered in this series — Amazon Web Services, Google Cloud, Microsoft Azure and IBM Watson — make the conversion of text to speech available at the call of a service.

This opens up exciting opportunities to rapidly develop engaging conversational applications with increasingly flexible and natural-sounding voices.

This article provides a list of use cases, an introduction to the Speech Synthesis Markup Language (SSML) and a comparison of four services that includes sample output, sample code and the results of a small study.

What can I do with speech synthesis?

If your goal is to convert a small or moderately long text to speech on a one-off basis, there is no technology available at this point that competes with the work performed by a voice-over talent in a recording studio.

If, on the other hand, you are looking to repeatedly create recordings of large quantities of personalized text, speech synthesis is almost certainly the right choice.

Here are ten applications that text-to-speech services are uniquely suited for:

Contact centers: Unless your customers have truly unique and highly specific inquiries, providing automated support may be a viable option. Instant replies, enhanced by some degree of flexibility — starting with a correct pronunciation of the customer’s name— can help with acquisition and retention efforts.
Creating language learning material: A human native speaker is not always available to demonstrate the correct pronunciation. Speech synthesis can be combined with methods tracking the interest that a language learner has and the progress he has made to generate personalized audio content.
Multilingual content: Some organizations use speech synthesis to create and routinely update training material in multiple languages for their global workforce.
Content re-purposing: Text-to-speech expands the range of opportunities for content creators. With more natural-sounding speech, articles can reach those who enjoy listening to audio books and podcasts while commuting to work or exercising in the gym. Combined with visual content, it may also open the door to more cost-effective video marketing.
Agile video (re)production: Scripts for animated content, such as explainer videos, evolve over time as project members come up with new ideas and clients request last-minute changes. Text-to-speech services can generate speech that is always in line with the latest version of the script. At the end of the project, the final script can then be replaced by a professional human-made recording.
Reminders: A popular feature of virtual assistant products is their ability to set reminders. Computer-generated speech can wake you up, help with habit formation and keep those memories of to-do list items that are at risk of slipping away front and center. There is a lot of room for personalization here. Do you prefer to be woken up by a soft and calm voice? Or would you like to start your day with an energizing motivational quote?
Artificial announcers: FM Wakayama, a privately funded non-profit organization that produces a community broadcast, has developed an artificial announcer for weather forecasts, general news and emergency information. When a typhoon hit the city of Wakayama in September of 2017, this cloud-based system continued to reliably and cost-effectively provide disaster information throughout the day.
Smart home devices: Whether it’s the robot vacuum cleaner that lets you know where it got stuck, the refrigerator asking you to confirm an automatically generated grocery list or the security system notifying you of an intrusion: speech synthesis gives smart home devices a voice.
Synchronization: Timing information is a by-product of the speech synthesis process. Speech marks describe where the utterance of a word or related actions starts and ends within a given audio stream. In gaming, this information helps bring characters to life by synchronizing their facial expressions with the content of their speech.
Testing: Speech synthesis as a service brings efficient split-testing to the realm of voice-based applications. The range of possible experiments goes well beyond the spoken content. Using SSML, experimenters can try different voices, expressions and prosody settings.

SSML

The input to a speech synthesis service is provided either as raw text or in the form of a Speech Synthesis Markup Language (SSML) document.

An SSML document defines what to say and how to say it.

Since all four of the services that we’ll discuss use this markup language, I thought it would be helpful to provide an introduction focused on the most of the commonly used elements and attributes.

Root element

The name of the root element in an SSML document is speak.

A minimal “Hello, world!”-style example looks like this:

“Hello, world!” in SSML

Note that version and xml:lang are sometimes left out, even though the specification states that they are required attributes.

Interpretations

The say-as element defines how the enclosed text should be interpreted. A speech synthesizer supporting the SSML standard offers different settings for the interpret-as attribute of this element.

For example, to spell out strings that would otherwise be read as words, a service can support the spell-out interpretation.

Instructing the speech synthesizer to spell out Google’s ticker symbol

Text structure

The optional elements p, s and w are used to define paragraphs, sentences and words, respectively.

The language attribute is inherited by child elements. Suppose, for example, that you wanted to create a resource for US-English speakers learning German. In that case, en-US is applied to the root element, while de-DE is set for individual elements within the document.

Text structuring and the use of multiple languages in SSML

Break

Occasionally, silence can be the most engaging content.

In SSML, speech is paused with the break element. The length of the pause is is determined by the time attribute.

Creating a moderately dramatic pause in SSML

Note that synthesizers may automatically add pauses after sentences or paragraphs.

Emphasis

The emphasis delivers what the name promises. The optional level attribute can be set to one of three possible values: reduced, moderate or strong.

Emphasizing words in SSML

Substitutions

Words enclosed in the sub element are substituted with different words in the speech output.

Common uses cases include abbreviations and symbols.

The chemical symbol Fr is substituted with the name “Francium”.

Prosody

The prosody element controls the pitch, speaking rate and volume of the synthesized speech.

The volume can be set in terms of predefined levels or changes relative to the current volume:

It’s getting loud in here.

The speech rate can be specified through predefined levels or as a percentage of the default rate.

Setting the speech rate with pre-defined values and relative to the default rate.

Analogously, there are presets and percentage-wise settings for the pitch:

Still waiting for the neural singing feature …

Markers

The mark tag is a self-closing tag that requires a name attribute.

Its only purpose is to place a marker inside the SSML document. It does not affect the output of the speech synthesis process.

Markers can be used to retrieve timing information for specific positions in the document.

The self-closing mark tag does not affect the output.

Who’s got the best voice of them all?

In this section, we take a look at four speech synthesis services: Amazon Polly, Google Cloud Text-to-Speech, Microsoft’s Cognitive Service Text to Speech and IBM Watson Text to Speech.

To obtain subjective ratings for the speech generated by these services, a small study was conducted. Three participants (two female, one male) listened to excerpts of Wikinews articles converted to MP3 files.

The five articles cover a car smashing into a dental office[4], the venues of the 2026 FIFA World Cup[5], Microsoft’s plan to acquire Github[6], a meeting between the Korean leaders[7] and Netta’s victory at the Eurovision Song Contest[8].

Blind to the names of the services and voices, the participants were presented with one speech output at a time in random order. They were instructed to rate how natural and how pleasant the speech they’ve listened to sounded on a scale from 1 (worst) to 5 (best).

For each of the four services, one male and one female voice was used. This yielded a total of 120 ratings (three subjects, five articles, four services, two voices per service). The text was sent to the APIs as raw text without any SSML optimization.

Playlists of the generated speech samples were uploaded to SoundCloud after the study had been completed.

The following table provides an overview of the findings:

Amazon Polly

Amazon’s text-to-speech service Polly was announced at the end of 2016.

At the time of writing, it supported 25 languages, ranging from Danish and Australian English to Portuguese Brazilian and Turkish.

A larger number of voices has several benefits. The generated speech can be used for dialogues, represent different personas and achieve a higher degree of localization.

Polly offers a selection of eight different voices for American English. The names of the voices used in the study are Joanna and Matthew.

It should be noted that Amazon has promised not to retire any current or future voices made available through Polly.

In the experiment I’ve ran, Polly achieved the second-highest overall ratings. In terms of the pleasantness of the speech, Amazon’s service edged out Google’s Cloud-to-Text API.

Here is the speech output that was presented to the participants:

Samples generated with the Amazon Polly voices Joanna and Matthew

Speech markers can be obtained for positions specified with a mark element and on the level of words and sentences. In addition, Polly allows you to retrieve visemes that represent the position of the speaker’s face and mouth when saying a word.

The console is a great way to experiment with SSML and get a first impression of the available feature set.

100 requests are allowed per second. The input can consist of up to 3,000 billed characters. SSML tags do not count as billed characters. The output is limited to 10 minutes of synthesized speech per request.

The pricing model is simple. During the first twelve months, the first five million characters are on Amazon. Above this tier, requests are billed on a pay-as-you-go basis at $4 per one million characters.

Speech synthesis with Amazon Polly

Polly supports all SSML tags that we’ve mentioned as well as two extensions: breaths and voice effects.

The self-closing amazon:breathtag instructs the artificial speaker to take a (fairly life-like) breath of a specified length and volume.

Voice effects include whispering, speaking softly and changing the vocal tract length to make the speaker sound bigger or smaller.

Deep voice and heavy breathing Amazon’s SSML extensions

Google Cloud Text to Speech

Cited more than 250 times, the WaveNet paper[1] published by researchers from Google DeepMind is an important milestone in the recent history of speech synthesis.

The Github repositories that sprang up to replicate the results achieved by the DeepMind researchers have been starred and forked thousands of times.[9, 10, 11]

In a study described in that paper, subjects were asked to rate the naturalness of speech generated with WaveNet, actual human speech and output of two competing models. On the same scale from 1 to 5 that was used in the study reported in this article, the mean opinion score was 4.2 for the WaveNet samples, 4.5 for the human speech and less than 4 for the competing models.

Last November, Google finally released the alpha version of its long-awaited Cloud Text-to-Speech service. At the time of writing, the service is in the beta stage and “not intended for real-time usage in critical applications”.

The service offers WaveNet-based speech synthesis and what Google refers to as standard voices or non-WaveNet voices.

The six available WaveNet voices are in US English. According to the documentation, these are the same voices that are used in Google Assistant, Google Search, and Google Translate.

The 28 standard voices cover several European languages and include a few female voices for Asian markets.

In contrast to the other services, the voices have technical identifiers rather than memorable names. The two voices I’ve used, for example, are referred to as en-US-Wavenet-A and en-US-Wavenet-C.

This is the playlist for the output used in the experiment:

Samples generated with the Google Cloud Text-to-Speech voices en-US-Wavenet-A and en-US-Wavenet-C

My own results are comparable to those reported in the WaveNet paper. Among the four competitors, Google’s service achieved the highest naturalness score and the best overall ratings.

If natural sound is primary concern, then this is most likely the right choice of you.

It should, however, be pointed out that both Amazon Web Services and IBM Watson offer more features. Neither timing information nor SSML extensions are supported by Google Cloud Text-to-Speech.

The premium price for the WaveNet functionality is set at $16 per one million characters for requests in excess of the first one million characters covered by the free tier.

Four million characters per month can be synthesized with the standard voices at no cost. Subsequent requests set you back pay $4 for every one million characters.

In addition to limits of 300 requests per minute and 5,000 characters per request, there is a quota of 150,000 characters per minute.

If you decide to the Java SDK, make sure to import from the package v1beta1 in the namespace com.google.cloud.texttospeech (and not from the v1 package).

Speech synthesis with Google Cloud Text to Speech

Microsoft Cognitive Services Text to Speech

Microsoft’s Cognitive Services Text To Speech is currently available as a preview. The greatest strength of this service is the degree of localization that it offers.

The 80 voices that are available across 32 languages cover an unparalleled range of European and Asian locales.

At this point, however, there is a clear trade-off between quantity and quality. The output generated with the two voices ZiraRUS and BenjaminRUS received the worst ratings in the experiment: 3.2 for naturalness and 3.33 for pleasantness.

The samples that were generated for the experiment can be accessed through the following playlist:

Samples generated with the Microsoft Cognitive Text To Speech voices ZiraRUS and BenjaminRUS

Microsoft’s customization feature creates a unique voice model using studio recordings and associated scripts as training data. This feature is currently in private preview and limited to US-English and mainland Chinese.

The free tier covers five million characters per month. In the S1 tier, the price per one million characters synthesized with the default voices is $2. Speech-to-text with custom models is available at a price of $3 per one million characters plus a $20 monthly fee per model.

A console appears to be available only for its precursor, the Bing text-to-speech API.

The service supports version 1.0 of SSML without extensions and limits the input to 1,024 characters per request, a fraction of the length of a news article.

The only official Java library that exists is used for Android development. Interacting with the REST API, however, is a straight-forward two-step process. The client first obtains a token by providing the subscription key. This token — which is valid for 10 minutes — is then used to obtain the synthesized speech from the API. Note that voices are specified inside the SSML document using the voice tag.

Speech synthesis with Microsoft Cognitive Services Speech to Text

Watson Text to Speech

IBM has introduced two interesting SSML extensions for its Watson Text to Speech service: Expressive SSML and Voice Transformation SSML.

The first extension is available for the US English voice Allison and implemented through the express-as element. The tag has a type attribute with three possible self-descriptive settings: GoodNews, Apology and Uncertainty.

Expressive SSML in Watson Text To Speech

One can easily see how Expressive SSML enhances customer support solutions and other applications aiming at life-like conversations.

While Watson Text to Speech comes only with support for 13 voices across 7 languages out of the box, the second SSML extension enables the creation of new voices.

Going beyond the benefits of a broad range of default voices that are in general use, unique voices can enhance branding efforts through a memorable and differentiated user experience.

Using the voice transformation element, customers can apply built-in transformations or define their own changes to create new voices based on the three existing US English alternatives.

Using the values Young and Soft for type attribute, the sound of the three existing voices can be made more youthful and softer.

To apply custom transformations, the type attribute must be set to Custom. This provides a fine-grained control over different aspects of the voice through optional attributes. Adjustable voice characteristics include the pitch, rate, timbre, breathiness and glottal tension.

In the experiments I’ve conducted, Watson Text to Speech performed slightly better than Microsoft’s service, but did not achieve the level of naturalness and pleasantness that Amazon and Google provide.

The names of the voices that have been used in the experiment are Allison and Michael. The generated samples rated by the participants are available through the following playlist:

Samples with the IBM Watson voices Allison and Michael

With the exception of the w tag, all of the SSML elements we’ve mentioned are supported. For languages other than US English, however, the say-as instruction is limited to only two types of interpretations: digits and letters.

Timing information can be obtained for words and markers.

The Lite plan is restricted to 10,000 characters. Under the Standard tier, the synthesis of the first one million characters is free to the consumer. Subsequent requests are charged at a rate of $0.02 per 1,000 characters, making Watson Text to Speech the most expensive among the four services.

A web demo showcases the basic functionality and the SSML extensions.

While the body of a single request can have, at most, 5,000 characters, there is no limit on the number of requests sent per minute.

The Java SDK works seamlessly and intuitively:

Speech synthesis with IBM Watson

Conclusion

A series of papers have described new machine learning approaches that significantly reduce the gap between machine-generated and human speech.
Speech synthesis services make use of these methods and offer an alternative to the resource-intensive process of training customized models.
Speech Synthesis as a Service speeds up the development of flexible voice-based applications and make it easier and more cost-efficient to create, test and re-purpose multi-lingual content.
The input to speech synthesis services is provided as raw text or in the Speech Synthesis Markup Language (SSML) format. SSML documents define what to say and how to say it.
Google Cloud Text-to-Speech has a limited feature set, but achieved the highest naturalness ratings and the best overall subjective ratings.
Amazon Polly outperformed its competitors with regard to the pleasantness of its speech and received the second-best overall ratings.
Watson Text to Speech and Amazon Polly provide rich feature sets, including useful SSML extensions and timing information.
Microsoft Cognitive Services Text to Speech offers the broadest range of voices, but received the worst subjective ratings.

Thank you for reading! If you’ve enjoyed this article, hit the clap button and follow me to learn more about machine learning services in the cloud.

Also, let me know if you have a project in this space that you would like to discuss.

References

[1] Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K., 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[2] Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., 2017. Tacotron: Towards End-to-End Speech Synthesis. arXiv preprint arXiv:1703.10135.

[3] Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J. and Miller, J., 2018. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In Proc. 6th International Conference on Learning Representations.

[4] Wikinews contributors. Airborne sedan smashes into dental office in Santa Ana, California. In Wikinews, The free news source you can write.

[5] Wikinews contributors. Football: Canada, Mexico and US wins joint bid to host 2026 FIFA World Cup. In Wikinews, The free news source you can write.

[6] Wikinews contributors. Microsoft announces plan to acquire GitHub for US$7.5 billion. In Wikinews, The free news source you can write.

[7] Wikinews contributors. Korean leaders Moon and Kim meet days after NK-US summit cancellation. In Wikinews, The free news source you can write.

[8] Wikinews contributors. Netta wins Eurovision Song Contest for Israel. In Wikinews, The free news source you can write.

[9] https://github.com/ibab/tensorflow-wavenet

[10] https://github.com/tomlepaine/fast-wavenet

[11] https://github.com/basveeling/wavenet