The world’s leading publication for data science, AI, and ML professionals.

How Does an ASR System Handle Never-Before-Seen Words?

Lessons from language: The first in an ongoing series from Dialpad's Data Science Research team

First off, if you’re new to automatic speech recognition (ASR), be sure to check out my colleague’s ASR 101 post. That info will come in handy because today, we’re going to go a step further and look at how to train an ASR system. Specifically, to learn new words that it’s never seen before.

There are three main components of a traditional ASR (automatic speech recognition) system:

  • the acoustic model,
  • the language model, and
  • the pronunciation dictionary.

The pronunciation dictionary contains words and their associated pronunciation (or multiple pronunciations in some cases). In this type of system, any word that isn’t included in this dictionary will never be output into a transcript.

And even with a dictionary of thousands upon thousands of words in it, we need a way to be constantly updating it, to make sure we have the most accurate transcripts for our customers. (Our vocabulary and slang is always evolving!)

So, if our transcription system comes across a word in the audio that is not in the dictionary, it will essentially take a guess at what was said using words which are in the dictionary and which sound similar to what was said.

Our company name, for example, would likely come out as "dial pad," while when COVID first appeared, it was transcribed as "covet."

One of the ways in which we train our models is by sending audio out to a team of transcribers who transcribe it, which gives us a highly accurate representation of what was said in the audio. We can compare the data we get from this team to our own dictionary to identify words that aren’t currently in the dictionary.

On average, about 6 out of every 1,000 words aren’t already present in our dictionary, and we call these "out of vocabulary" words, or OOV.

What’s in an OOV? (There are 6 types)

You may be wondering, what words could possibly be missing from a dictionary that already contains thousands upon thousands of entries? Well, we did an analysis on a small subset of them to find out. They mainly fit into six different categories:

1. Person’s name – 46%

The vast majority of OOVs are the names of people. Peoples’ names present quite a challenge for Asr systems for a variety of reasons.

Many first and last names have multiple spellings and pronunciations. There’s also simply a vast number of different names, especially as we strive to include names from many different origins, not just those common in Western countries.

As an example, here is a list of all the accepted spellings of the name "Kira": Kaira, Keera, Keira, Kiera, Kyra, Kyrah, Kyrha, Kyria, Kyrra, Kirra, and Ciara.

2. True OOVs – 24%

About one quarter of the words are what we determine to be "true" OOVs. These are English words that haven’t made it into our system, likely due to their rare usage or being more recently accepted into our verbiage.

A few examples of "rare-usage" words that you might’ve come across include "subtropical" and "embeddings," while a good example of a word that has recently exploded in usage might be "gaslighting."

3. Initialisms – 20%

About 20% of the words we reviewed turned out to be initialisms, which are similar to acronyms but are pronounced as the letters that make them up rather than pronounced as a word.

For example, NASA is an acronym where CRM is an initialism – and ASAP can be both! As anyone who has entered a new job recently can tell you, businesses like to use a lot of initialisms!

We want to make sure we recognize them to format them correctly as well as being able to differentiate them from times when a speaker is actually spelling a word out.

We need to get this to the CEO by EOD.

versus

My email address is spelled J-O-N-E-S at Dialpad dot com.

4. Company names – 8%

Company names generally fall into two categories that we see:

  • Those which are pronounced like a common word or words – think LinkedIn = linked in and Zoom = zoom, and
  • Those that are not, like Deliveroo and Pinterest.

No matter the pronunciation, we need to be able to recognize these names to transcribe and format them properly. An example of a company name we came across was Spiceworks.

Without knowing there is a company with that name, our transcripts would simply transcribe it as "spice works."

5. Linguistic innovations – 2%

The last category of English OOVs, and admittedly, my personal favourite, are the linguistic innovations of our speakers, otherwise known as "words that don’t actually exist but were spoken by someone."

Humans are quite adept at molding words to convey meaning without disrupting the understanding of the listener! A good example of this that I came across recently would be the word "informationable," which does not appear in Merriam-Webster but is still able to convey the definition of "the ability to provide information."

Another example we often see in this category is the phenomenon of generic trademarks, or when a proprietary trademark starts being used as a descriptor for the product or service as a whole.

It shows up most often as an OOV when people are modifying a brand name into a verb such as, "I Craigslisted my old phone."

6. Non-English words – 1%

A small selection of the words were Non-English words. For the majority of cases, we don’t include other languages in our dictionary, with the exception of those that are commonly used by English speakers.

An example of this would be the word "en" from the phrase "en route," which originates from the French language, but is used commonly in English conversation.

This has been our policy up until now, but after reading a compelling article about the issues associated with this approach, we’ll be looking to more equitably support loanwords from different languages and cultures in our products.

How else can we identify OOVs?

Language is awesome – and tricky. The method described above isn’t even close to being enough to identify all the jargon, initialisms and names being constantly used in conversations.

We’re processing millions of minutes of conversations, while only a miniscule fraction are sent out for transcription.

To try and mitigate this, we offer our customers the option to submit these words directly to us through their company dictionary so we can quickly add them to our pronunciation dictionary and start recognizing them in our transcripts within two weeks of submission.

This is by the far the most effective way to help improve the accuracy of these words in your transcripts. The full process is shown below:

Image provided by Dialpad
Image provided by Dialpad

Not only does adding submissions help improve the accuracy of your own transcripts, but it also helps other customers who use those same terms! There’s one more major way that we’re able to identify OOVs, and that is through a process called webscraping.

Essentially, we pull data from our customer’s websites and perform analyses on the texts to separate important terms like product names or jargon from common words such as "the."

This has been a hugely successful approach as it provides a more holistic view of terms being used by a team. While using the company dictionary is great, the person making the entries might use different words than someone on a different team or in a different department.

Pairing the dictionary with webscraped data allows us to have a more complete list of terms likely to show up in our users’ transcripts!

Welcome to the biggest language class of all time

We’re always on the lookout for new sources of keyphrases to include in our models and additional ways to provide more accurate transcripts.

And our journey to develop the most accurate transcripts for our customer’s data has been hugely helped by our customers who took the time to provide thoughtful entries in their company dictionaries.


Related Articles