A Very Nice Way To Build an “Ok Google” Vocal Application… Without Google

Published in

Towards Data Science

5 min readJul 14, 2021

We all know voice-powered applications are increasing in number and effectiveness, due to better NLP models and consumer home hardware like Alexa or Home.

According to Google, 27% of the online global population is using voice search on mobile, the trend is rising and effectively, once you start to use it, it becomes very fast and natural.

So, for a personal project I’m developing (more on this in few weeks), I decided to build a vocal interface to perform some actions and my needs were:

Be able to have a wake-up word, exactly like “Hey Google”, to start the vocal input phase
Have some structured data model, like an “intent” to link to commands execution, instead of just catching the word with a speech to text library and write tons of “ if word== ‘this’ do that”. In other words, having a chatbot approach like DialogFlow or Lex.
Be offline for privacy concerns, but without building and deploying a TensorFlow or Pytorch solution locally, for sure not a trivial task

So, when I was resigned to trading privacy and control for easiness of use — the same old story, after all — I discover Picovoice!

Enter Picovoice

Picovoice is, quoting from their website, the end-to-end platform for building voice products on your terms.

The most relevant feature is it can be deployed on a specific device and runs locally, so the solution can work without the need for an Internet connection.

This, because a specific model can be trained online, using a specific web application — the Picovoice Console — and then simply downloaded and used by the application, without a further connection.

But let’s see how it works more in detail

First of all, it has four different engines:

Porcupine Wake Word Engine
Rhino Speech to Intent Engine
Cheetah and Leopard Speech to Text Engine

As their names imply, the first one is used to handle a wake-up word, triggering an action when detected, the second to understand intents, based on specific words detection, the other two for generic Speech To Text.

I’m using only the first two, but probably I will check one of the others because it would be great to have a full and working well offline solution for speech to text too

Let’s dig into it!

Porcupine

The Console is very straightforward. You input a wake-up word, choose a language and train the model.

Once ready, you can immediately test it on the Console (or test some default ones already present like “Hey Google”, “Alexa”, “Jarvis” and more)

We’ll see in a moment how to use it inside an application.

Rhino

The Console is very intuitive too, but here things are a bit more complicated because we need to declare the intents, eventually with variables inside them.

Let’s see an example, pretending to build a domotic solution to control lights.

In this example, there are three intents but let’s focus on the first one: it’s associated with different phrases to trigger it, each one composed of alternate words, optional words and possible parameters, called slots.

So, to trigger the first trigger I can just say “Turn light green” or “ Make lights green” or “Switch all lights to green” and so on.

As the model is trained in advance to be used then locally, the slots too have to be decided in advance. This, of course, can be a drawback if things change, because a new model needs to be trained with the new elements (in this example if new colors become available).

In that case, a speech-to-text solution can work better, because no retraining is needed.

Once everything is set, it’s possible to test the model almost immediately. Just click on the mic and talk!

and, finally, it’s possible to download and use it locally on a target device

Notice it can be used, with some restrictions, for personal use, but there is an Enterprise license too.

But then, what? How these models can be used? Simply by building an application and using Picovoice API and SDK

There is exhaustive documentation and several available options, so it should be very fast starting to use them.

Screenshot taken from the documentation website

Conclusions

Ok, but should I use it? Yes, definitely

First and most important, it works very well regarding words detection, at Google and AWS services level, even better maybe.

Then, running it locally brings advantages in terms not just of privacy but performance, as there is no network latency.

But, be able to try online the model, without downloading and testing it locally, is really a time saver.

Finally, it can be deployed on several different platforms (Linux, Windows, macOS, Android, iOS, Raspberry Pi and more exoteric others), making it very flexible.

So, if you need to build a voice-based application or just add some vocals to your app, Picovoice is a great choice: maybe is a little more complicated than using a managed service, but performances and privacy deserve the extra effort.

A Very Nice Way To Build an “Ok Google” Vocal Application… Without Google

Enter Picovoice

Porcupine

Rhino

Conclusions

Written by Antonello Calamea