A Modest Proposal to Prevent False Triggers on Voice Assistants

Just use internal noise cancellation to prevent voice assistants from triggering themselves with feedback

Nathaniel Watkins
Towards Data Science

--

That’s it. The title says it all.

Image by TechCrunch

Just kidding, I’ll explain the idea further, why I think it’s a necessary solution to the problem, and why I feel like I needed to write this article in the first place.

Two Sentence Takeaway

Voice assistants like Alexa, Cortana, and the Google Assistant currently seem to be able to trigger themselves if the sound that they are outputting contains their wake word or at least something close enough, due to audio feedback. I propose solving this by feeding their microphone input through a noise control algorithm that filters out the assistant’s audio output.

The Problem

Have you ever been playing a podcast, the news or even music through your assistant device (Google Home, Amazon Echo, etc.) only to have it say those dreaded words “Ok Google”/“Alexa”? Then you no doubt have experienced the problem I’m trying to solve. Essentially, these assistants are able to tickle themselves by saying their own wake word, then responding to that wake word. Worse yet is when this happens and the wake word/phrase wasn’t even said.

Amazon’s Step in the Right Direction

Recently Amazon’s Alexa team has come up with a clever approach that could partly help solve this problem, but it’s not foolproof. Essentially, they created a database of fingerprints of known false positive “Alexa” triggers (such as Amazon commercials), then they check new triggers against this database every time it thinks someone said the wake word. If you have an Echo device and watched the recent Superb Owl handegg game, you may have noticed that the Alexa commercial surprisingly didn’t drive your Echo bonkers. Additionally, Amazon created a way to scale this database automatically for 3rd party false positives by flagging wake events from multiple customers with a similar fingerprint to record them as a “media event”.

An illustration of how fingerprints are used to match audio. Different instances of Alexa’s name result in a bit error rate of about 50% (random bit differences). A bit error rate significantly lower than 50% indicates two recordings of the same instance of Alexa’s name. Image by the Amazon’s developer blog

While this should provide a noticeable improvement, this approach will not be sufficient to solve this problem entirely, especially since most of the media we consume nowadays is streaming and thus non-synchronous, so many false triggers won’t happen at the same time. Furthermore, this approach involves comparing fingerprints in the cloud for all but the most predictable triggers, which increases the processing time and bandwidth used by the assistant. I applaud Amazon for implementing such a novel approach (though it’s not entirely novel as I’m pretty sure Google’s individualized voice recognition model follows similar design patterns) even if not all-encompassing.

But what about the noise-canceling from the multi-mic setups?

True, most (if not all) voice assistant devices nowadays come with multiple microphones to detect background noise and filter it out, focusing on your voice using a noise control technique called beamforming. However, this problem is only tangential to the problem at hand. For example, the multi-mic setup is great for clearly (more or less) hearing your commands while tuning out the loud TV in the room, but when it comes to the device hearing commands when no humans nearby have said anything, this setup is not the right tool for the task.

The ‘Big’ Idea

My proposal is simple really, so simple that I used to assume that all the major players in voice assistants were utilizing this approach. However, as I’ve spent more time yelling at robots in my home to tell me the news, I’ve come to realize that they either aren’t doing this or if they are, they aren’t doing it well. And after hearing anecdotes from plenty of other people, I know I’m not the only one who makes sure to listen to my favorite tech podcasts using headphones to avoid giving my overeager AI assistant the wrong idea.

Okay, so perhaps they’re not doing this with the current generation of voice assistants, but this method has to have been proposed elsewhere by now, right? Maybe, but I don’t think so. I’ve done several searches using a variety of terms that should be relevant and using a variety of platforms from regular Google search, to Google Scholar searches, to Semantic Scholar searches. If there’s a research paper, patent, blog post, or just about any other publicly accessible resource out there, then I feel like I should have found it; I usually find what I’m looking for. If I missed something key, I’d love to hear about it. I found plenty of resources on Active Noise Control (ANC), and plenty more on AI Voice Assistants, but the closest that I could find for this application was this paper on noise cancellation for voice assistants, but alas, the scope was limited to external noise, as discussed above for multi-mic setups: https://pdfs.semanticscholar.org/e448/5abce7703bace8fdd3ba3e65688e1c60d827.pdf?_ga=2.71900052.669585245.1549069009-41674264.1549069009

Diagram by Silentium

A quick overview of terms:

Active Noise Control (ANC), sometimes referred to as Active Noise Cancellation, filters out unwanted noise by recording the unwanted noise then playing back its opposite waveform (basically just the same sound shifted half a phase back in time), thus canceling out the unwanted noise while keeping the wanted noise.

Our problem effectively boils down to audio feedback, which is where a system is picking up its own audio output within the audio input. In a live sound system like in a theatre, feedback quickly shifts into an ear-splitting screech because of a feedback loop amplifying the high frequencies. Fortunately, our voice assistants don’t succumb to feedback loops because they don’t immediately output their input audio, but whenever your assistant triggers itself, it is reacting to feedback. My background in A/V systems is likely why I previously assumed that this method was already widely used.

How it’d work:

This system should be able to be implemented completely within the software of most voice assistant systems, and thus able to be deployed with an update on current generation voice hardware such as Google Homes and Amazon Echos. Instead of using a separate external microphone as the input to the ANC filter, I propose using the audio feed that is being sent to the speaker. Since the device is generating that audio feed, we can know with 100% certainty that we don’t need to use any of that audio as input. This should not only prevent feedback from accidentally triggering the assistant, but it’d also help the assistant focus on the human’s commands while audio is being outputted, thus making it easier to shout over your music when you want to talk to your assistant. Best of all, this approach doesn’t ever require pinging the cloud for verification, unlike Amazon’s approach, and shouldn’t add undue processing overhead to the assistant device as ANC can be a relatively simple calculation. Just like most people are unable to tickle themselves because our brains know what inputs are being generated by our own actions, this method would make AI assistants unable to trigger themselves.

The orange elements represent the software bits that I propose adding to the system.

Similar implementations of this idea:

While I wasn’t able to find any evidence of this technique used for voice assistant devices, I did find this method utilized in other applications, such as feedback suppression in pro audio sound systems or speakerphones like this expired patent: https://patents.google.com/patent/US6212273B1/en

So if this is already a part of an existing AI assistant’s design, I apologize for restating the obvious, but keep reading because there are a few ways to build on this idea, which I don’t believe are yet developed. To the best of my knowledge, this method or anything similar is not currently being used in any AI assistants.

Potential roadblocks:

One possible issue could be digitally introducing feedback if no significant acoustic feedback is being picked up by the device. Many voice assistant devices have pretty sophisticated microphone arrays and speakers designed to minimize feedback, so a simplistic implementation of this method could theoretically be adding new audio to the input if that same audio is not being picked up by the microphone loudly enough. This is due to how ANC works: simply playing back the same audio out of phase. One way to avoid this problem would be to scale the gain/volume of the output audio sent to the ANC by the level of audio coming in from the microphone input. This scaling could happen dynamically as a function of the ANC algorithm, and a baseline could be set every time that the device is powered on as a way to calibrate it for its specific acoustic environment.

Another potential downside might be the need to retrain trigger word detection algorithms to account for the filtering of the ANC algorithm. However, from what I understand about the algorithms, they would likely be robust enough to function well under those different parameters. Plus, if they were trained with very clean audio, this filtered audio would likely be much closer to the training data.

Gif by tenor

Further capabilities:

The holy grail of AI voice interfaces is Full Duplex Communication with the AI, where there can be overlap between the AI talking and the human talking and both parties can keep up. Current versions of voice assistants are like walkie-talkies, where if you interrupt it, it has to stop talking; this is referred to as Half Duplex. While the steps to making a voice AI that is convincingly human are much more complex than just avoiding feedback, feedback could definitely be a major stumbling block for the AI. This method should help get us there: by preventing the AI from listening to its own words.

Additionally, while this method talks about the output from one device digitally being fed through an ANC filter, there’s no reason to limit our thinking to feedback from just the same device. The assistant could potentially receive a feed of audio from any device within earshot. Say for example we didn’t want the audio from a Chromecast or a Smart TV confusing our assistant, you could set up those devices to digitally send their audio feed over to the assistant for it to filter out their signals as being irrelevant or unwanted. Similar to the issue mentioned above if there is a lack of acoustic feedback, it’d be crucial for the devices to all calibrate themselves in their environments. This could be accomplished by whenever one boots up, it plays a quick calibration tone, and then the assistants within earshot use that to set the appropriate levels on the ANC filter. This would solve the problem that Amazon attempts with their Alexa fingerprint database, by simply having the TV tell the assistant to ignore all audio from it, including any potential triggers.

Why Present This Proposal

With the level of detail that I’ve developed for this idea and my research into similar applications, I could probably successfully file for a patent. But I believe in sharing good ideas and as far as I could tell, this idea has never been shared, even though it seemed incredibly obvious to me. Plus, this should count as prior art to preempt anyone else from filing a patent for the application of internal ANC to voice assistants. I don’t have the resources to execute on every idea I have, so why not put them out there and see if anyone else can benefit? I’d love to hear from you if you end up using this idea or if you’d like to work together on something.

If any experts in ANC, HCI (voice), Conversational AI, or any other related field are reading who would like to explore this deeper, please let me know. I’d be open to researching this further or testing it out. Perhaps I missed some key factor or work already proposing this method. If so, I‘d really like to find out about it.

Also, I have a tangential idea for preventing accidental AI triggers using ultrasonic frequencies. Let me know if anyone would be interested in hearing more about it, and I’d be happy to develop the idea further.

I look forward to hearing any feedback or questions anyone has on this article or the topics discussed, either in the responses here or on social media. Feel free to connect with me (just let me know that you saw this article) →

twitter.com/theNathanielW

linkedin.com/in/theNathanielWatkins

--

--

A dedicated manager turned Data Scientist, enthusiastic for the innovative ways technology can impact people.