How Amazon Alexa works? Your guide to Natural Language Processing (AI)

Published in

Towards Data Science

8 min readNov 21, 2018

We can talk to almost all of the smart devices now, but how does it work? When you ask “What song is this?”, what technologies are being used?

How does Alexa work?

According to Adi Agashe, Program Manager at Microsoft, Alexa is built based on natural language processing (NLP), a procedure of converting speech into words, sounds, and ideas.

Amazon records your words. Indeed, interpreting sounds takes up a lot of computational power, the recording of your speech is sent to Amazon’s servers to be analyzed more efficiently.

Computational power: refers to the speed that instructions are carried out and is normally expressed in terms of kiloflops, megaflops, etc.

Amazon breaks down your “orders” into individual sounds. It then consults a database containing various words’ pronunciations to find which words most closely correspond to the combination of individual sounds.
It then identifies important words to make sense of the tasks and carry out corresponding functions. For instance, if Alexa notices words like “sport” or “basketball”, it would open the sports app.
Amazon’s servers send the information back to your device and Alexa may speak. If Alexa needs to say anything back, it would go through the same process described above, but in reverse order
(source)

In-depth explanation

According to Trupti Behera, “It starts with signal processing, which gives Alexa as many chances as possible to make sense of the audio by cleaning the signal. Signal processing is one of the most important challenges in far-field audio.

The idea is to improve the target signal, which means being able to identify ambient noise like the TV and minimize them. To resolve these issues, seven microphones are used to identify roughly where the signal is coming from so the device can focus on it. Acoustic echo cancellation can subtract that signal so only the remaining important signal remains.

The next task is “Wake Word Detection”. It determines whether the user says one of the words the device is programmed to need to turn on, such as “Alexa”. This is needed to minimize false positives and false negatives, which could lead to accidental purchases and angry customers. This is really complicated as it needs to identify pronunciation differences, and it needs to do so on the device, which has limited CPU power.

If the wake word is detected, the signal is then sent to the speech recognition software in the cloud, which takes the audio and converts it to text format. The output space here is huge as it looks at all the words in the English language, and the cloud is the only technology capable of scaling sufficiently. This is further complicated by the number of people who use the Echo for music — many artists use different spellings for their names than there are words.

To convert the audio into text, Alexa will analyze characteristics of the user’s speech such as frequency and pitch to give you feature values.

A decoder will determine what the most likely sequence of words is, given the input features and the model, which is split into two pieces. The first of these pieces is the prior, which gives you the most likely sequence based on a huge amount of existing text, without looking at the features, the other is the acoustic model, which is trained with deep learning by looking at pairings of audio and transcripts. These are combined and dynamic coding is applied, which has to happen in real time.” (source)

Analysis of an “order”

The above command has 3 main parts: Wake word, Invocation name, Utterance. (this part is extracted from Kiran Krishnan’s article)

Wake word
When users say ‘Alexa’ which wakes up the device. The wake word put the Alexa into the listening mode and ready to take instructions from users.
Invocation name
Invocation name is the keyword used to trigger a specific “skill”. Users can combine the invocation name with an action, command or question. All the custom skills must have an invocation name to start it.

Alexa “skills”: voice-driven Alexa capabilities.

Utterance
‘Taurus’ is an utterance. Utterances are phrases the users will use when making a request to Alexa. Alexa identifies the user’s intent from the given utterance and responds accordingly. So basically the utterance decide what user want Alexa to perform.

After, Alexa enabled devices sends the user’s instruction to a cloud-based service called Alexa Voice Service (AVS).

Think the Alexa Voice Service as the brain of Alexa enabled devices and perform all the complex operations such as Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU).

Alexa Voice Service process the response and identify the user’s intent, then it makes the web service request to third party server if needed.

What is NLP?

It’s a convergence of artificial intelligence and computational linguistics which handles interactions between machines and natural languages of humans in which computers are entailed to analyze, understand, alter, or generate natural language.

NLP helps computer machines to engage in communication using natural human language in many forms, including but not limited to speech and writing.

“Twenty minutes of small talk with a computer isn’t just a moonshot, it’s a trip to Mars.”

In this article, I found an interesting part that says “Understanding human language is considered a difficult task due to its complexity. For instance, there is an infinite number of different ways to arrange words in a sentence. Also, words can have several meanings and contextual information is necessary to correctly interpret sentences.”

At the start, the system gets an input of natural language.

Natural language: any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing

After, it converts them into Artificial language like speech recognition. Here we get the data into a textual form which NLU (Natural Language Understanding) process to understand the meaning.

A good rule is to use the term NLU if you’re just talking about a machine’s ability to understand what we say. NLU is actually a subset of the wider world of NLP

Hidden Markov Model (NLU Example) :

In voice recognition, this model compares each part of the waveform against what comes before and what comes after, and against a dictionary of waveforms to figure out what’s being said.

Waveform: periodic vibration of the vocal folds resulting in voiced speech

A hidden Markov model (HMM) is one in which you observe a sequence of emissions, but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data.

For Trevor Jackins, Marketing Specialist at NeoSpeech, “It tries to understand what you said by taking the voice data and breaking it down to a small sample of particular time duration mostly 10–20 ms. These data sets are further compared to pre-fed speech to decode what you said in each unit of your speech. The purpose here is to find phoneme (the smallest unit of speech). Then, the machine looks at the series of such phonemes and statistically determine the most likely words and sentences to spoke.” (Source)

Then NLU gets to deeply understand each word where it tries to understand whether it is a Noun or Verb, what is the tense used, etc. This process is defined as POS : Part Of Speech Tagging.

According to Pramod Chandrayan, CPO at EasyGov, “NLP systems also have a lexicon (a vocabulary) and a set of grammar rules coded into the system. Modern NLP algorithms use statistical machine learning to apply these rules to the natural language and determine the most likely meaning behind what you said.” (source)

For the company called Lola.com, “To build machines that understand natural language, it is necessary to distill speech using a combination of rules and statistical modeling. Entities must be extracted, identified, and resolved, and semantic meaning must be derived within context, and be used for identifying intents. For example, a simple phrase such as: “I need a flight and hotel in Paris from December 5 to 10” must be parsed and given structure:

need:flight {intent} / need:hotel {intent} / Paris {city} / DEC 5 {date} / DEC 10 {date} / sentiment: 0.5723 (neutral)”

(source)

For Bernard Marr, Author, Keynote Speaker and Advisor, “When Alexa makes a mistake in interpreting your request, that data is used to make the system better the next time. Machine learning is the reason for the rapid improvement in the capabilities of a voice-activated user interface.” (source)

On the Amazon website, we can read that “With natural language understanding (NLU), computers can deduce what a speaker actually means, and not just the words they say. Basically, it is what enables voice technology like Alexa to infer that you’re probably asking for a local weather forecast when you ask, “Alexa, what’s it like outside?”

Today’s voice-first technologies are built with NLU, which is artificial intelligence centered on recognizing patterns and meaning within human language. Natural Language Processing with voice assistants as its proxy has already redefined how we interact with technology, in the home and otherwise.” (source)

You can have a look into the code behind an Alexa device here:

alexa/avs-device-sdk

An SDK for commercial device makers to integrate Alexa directly into connected products. - alexa/avs-device-sdk

github.com