Can you hear me now? Far-field voice

In my previous post, I made the case that successful AI companies will develop a moat within the “computational speech” value chain by either creating a network effect around data or developing proprietary algorithms. In this post, I will examine the first step in that value chain, the initial pickup of voice signals, and identify opportunities where startups can succeed.

Jerry Lu

Published in

Towards Data Science

9 min readAug 1, 2017

Zooming in on voice

Voice control is an intuitive method of interacting with hardware and associated services. It’s far more natural than connecting up a keyboard and mouse, tapping an on-screen keyboard, or clicking a remote control.

As we see more voice-activated devices, it is important to understand how a device is capable of listening to and understanding voice commands. There are many components in this process, but the two we will examine are the microphone array (the hardware) and the deep learning architecture (the software) that make such a complex system work.

Hardware | Microphone arrays

Speech recognition systems often use multiple microphones to reduce the impact of reverberation and noise. With each generation of the iPhone, there has been an increase in the number of microphones, from one in the first iPhone to four in the iPhone 5 and 6S.

Even smart speaker devices like the Amazon Echo are using up to seven microphones. In fact, the Echo mics are arranged in a hexagonal layout, with one microphone at each vertex and one in the center. The delay between each microphone receiving the signal enables the device to identify the source of the voice and cancel out noise coming from other directions. This is a phenomenon known as beamforming.

Software | Deep learning

Deep learning has played a fundamental role in voice pickup. The ability to recognize spoken language was established a few years ago, but learning-based techniques like Deep Neural Networks (DNNs) have allowed language processing to equal or surpass human performance in many test cases.

AI Progress Measurement | Electronic Frontier Foundation: https://www.eff.org/ai/metrics

Only a strong combination of hardware and voice recognition algorithms can lead to product success. With bad microphones, recognition accuracy degrades no matter how intelligent of a deep learning model is employed. On the flip side, having excellent microphones with sub-optimal machine learning does not produce the necessary accuracy.

The problem of far-field voice pickup

While state-of-the-art speech recognition systems perform reasonably well in close-talking microphone conditions, performance degrades in conditions where the microphone is far from the user.

Imagine a common scenario in which one person is indoors, speaking to an Amazon Echo.

The audio captured by the Echo will be influenced by 1) the speaker’s voice against the wall of the room, 2) the background noise from outside, 3) the acoustic echo coming from the device’s loudspeaker, and 4) the output audio against the wall of the room.

These factors all contribute to low signal-to-noise ratio (SNR), room reverberation, and unknown direction of speech and noise — all important challenges that need to be addressed. As the user moves further away from the product’s microphones, the speech level decreases while the background noise level remains the same. Beyond noise and reverberation, other challenges include a lack of large-scale far-field data and unexplored efficient deep learning architectures. The bottom line is that there is still a huge gap between speech recognition and human performance in these far-field scenarios.

How does far-field voice pickup work?

Speech recognition systems often use separate modules to perform speech recognition.

The audio input is sent to an acoustic sensor,
which converts acoustic signals to electronic, then to digital signals.
Then, it goes to a digital signal processing chip, where speech enhancement is applied with fixed embedded algorithms. These embedded algorithms perform traditional signal processing techniques: source localization (locating the direction of sound) and beamforming (suppressing background noise).
The resulting enhanced signal goes to a conventional acoustic model for speech recognition.

Microphone technology aims to replace fixed embedded algorithms with a deep-learning-based trainable algorithm.

The drawback of fixed algorithms and components is that they cannot adapt to the rest of the trainable machine learning systems built on top of them. When you put a trainable deep learning system on a set of fixed chips, that deep learning model has to learn what the embedded algorithm is doing, undo it, and perform its own computation on top of it. This complicates things for far-field speech recognition, because the audio input is already distorted and along each step of 1) converting the signal from acoustic to electric to digital and then 2) pre-processing, you lose information along the way.

Google has been at the forefront of this research, demonstrating the use of raw waveforms as the output of acoustic sensors, and thereby avoiding any of the pre-processing done by the built-in chips (i.e., localization, beamforming) of today’s systems.

Google’s neural network adaptive beamforming model architecture

In essence, Google is looking to combine steps 3 and 4 in the process. The idea is to give the microphone array more degrees of freedom to optimize the algorithms based on the data. With the acoustic sensor, you just hope that it detects correctly and doesn’t add too much noise or distortion, and that the rest of the system can retrieve its information using trainable deep learning architectures.

Opportunities for startups

Typically, fixed algorithms are developed for general cases based on heuristics, while deep-learning-based algorithms are trained with the actual data itself for that specific task. Microphones that consist of fixed chips with embedded algorithms are probably not what you want when you have a deep-learning-based system.

Given that scope, an ideal hardware setup for any speech recognition application should do minimal digital signal processing on those fixed embedded chips. It should take in raw waveforms and build complicated deep learning algorithms on top that are both trainable and flexible.

The question then becomes whether or not you can collect a lot of training data to train a deep learning algorithm for that specific hardware, or use other deep learning techniques to compensate for the differences in acoustic sensors. The problem today is that deep learning speech recognition models are trained on hundreds of thousands of hours of data that were collected on different hardware.

In order for startups to compete against the big tech giants who own the full design capability, they need to look for opportunities to disrupt the value chain through data, software, or automation.

The hardware space has huge potential, as sophisticated deep learning techniques must be deployed on highly-customized inexpensive hardware.

the battle of voice-enabled hardware, @ziwang ’s toys

All existing major data sets will have been utilized, and access to these models is widespread. Companies that build the infrastructure to gather, annotate, and train new models for data sets that don’t exist yet will succeed.

Take a company like Vesper. They are changing the way microphones are designed by exploiting the physical property of piezoelectric material (I’ll explain the underlying technology in a future post). As a result, their microphone technology doesn’t suffer from dust and environmental degradation like traditional mics do, so their microphones are much higher quality and therefore deliver higher fidelity results, which serve as the signal transmitted for further processing.

The key to production-level recognition is the extensive use of data, advanced techniques in data augmentation, and model architecture.

The weakness of fixed algorithms and components is that they cannot adapt to the trainable machine learning systems built on top of them. But, if your end-to-end model is like those discussed above, the deep learning model can actually learn which information features should be extracted by relating them to the overall goal, which is the decoded characters or words in speech recognition.

Automation will comprise of personalized devices and continuous learning.

For example, with data collected from a particular user, the whole pipeline can be optimized to understand that user better.

But personalization still poses many problems. After a good initial system has been deployed, the deep learning algorithm needs continuous training based on usage, as behavior changes.

Both excellent hardware and intelligent software is needed, whether improving performance for a user (e.g., using their language, speech, etc.) or providing user-specific solutions (e.g., an assistant with personalized voice, personality, etc.). Personalized hardware and deep learning systems can bring a competitive advantage in both these areas.

Baidu is at the forefront of this research with the recent announcement of their Deep Voice 2 system. Constructed entirely from deep neural networks, the system can learn the nuances of a person’s voice with just half an hour of audio and can learn to imitate hundreds of different speakers. This system gives machines new diversity of speech and will be huge in bringing personalization and familiarity to speech recognition.

One interesting personalization opportunity for startups is to be the first to adapt and market their speech recognition systems to other languages. Although a company like Baidu may adapt Deep Voice 2 to a language like Chinese in the near future, such companies are ignoring the problem today.

Another interesting domain and verticalized niche for personalization might be accessibility. For example, Ava is using voice recognition software to translate conversations into text for people with hearing impairments. Similarly, an interesting startup in Turkey called WeWalk is building a smart cane for the visually impaired and leveraging Nuance communications for their speech technology.

Personalization and continuous learning will expedite the voice first revolution, allowing machines to replicate human-like interaction, smoothing awkward conversations, and enabling customization.

Fluent.AI, a company based in Montreal, is tackling improvements in both software and personalization. This company’s solution is trying to bridge the gap directly from audio input to NLP without the speech-to-text step in between, extracting intent directly from spoken commands and context. In doing so, their intent recognition system becomes truly personalized, learning from user-dependent accents, varying contexts, and acoustic behaviors to recognize frequent phrases across different environmental effects.

By further collaborating with OEMs and suppliers, the Fluent.AI solution is able to have more control over the data that is collected. Through this partnership, Fluent.AI can offer OEMs a faster, more cost-efficient road to market as well as broader reach than conventional technology.

With all that said, there is plenty of space for a smaller, nimble player to get traction if they focus on one or more of those three key areas — data, software, and automation. Ultimately, startups who are able to own the entire voice stack — both the data collection via the hardware and the trainable deep learning algorithm in the software — will be at a competitive advantage, providing better performance in speech recognition and language understanding.