A tale of two models

Lessons in Language: How we built AI models that determine a call’s purpose

Published in

Towards Data Science

6 min readJul 26, 2021

The ASR and NLP teams at Dialpad are constantly innovating. In fact, we recently introduced a new feature that leverages the power of AI to detect the purpose of calls so customers can better understand common interaction patterns and extract actionable insights from these patterns.

However, as we’ve discovered, phone calls are not as simple as they seem. Check out our other articles on why we built this and the data we looked at for additional context. In this piece, we’ll delve deeper into the two AI models that we developed for this specific feature, including the success we saw and challenges we overcame.

The simpler rules-based model

First, equipped with our insights from the data (see the blog mentioned above), we built a simple rules-based or heuristics model that took just a few weeks to put into production. Because Dialpad serves many businesses in many industries, we couldn’t rely on a call topic-based model — a model that is trained to recognize all the aspects of servicing vacuum cleaners would be nearly useless for scheduling interviews or gathering feedback on educational products.

We had to find something more universal.

Some of the things we trained the model to look at:

The presence or absence of certain lexical patterns that typically occur in Purpose of Call statements — patterns such as “calling because” or “reason for calling”
Surrounding textual context — look at what was said before, questions that were asked
Timing cues — pay more attention to certain portions of the call, like the beginning of the call or the section of the call immediately following customer account verification.
Speaker category — determine whether the speaker is the person who made the call or is receiving the call

Dealing with fuzzy conversational data, the model had to allow for lexical and structural variability, be forgiving of speech recognition errors, and handle the peculiarities of spoken language, like incomplete sentences and extensive use of filler words.

We decided to build a simple model at first for two reasons — to gauge customer interest in the feature before investing more time in it and to cheaply assemble a training dataset for a supervised machine learning model.

As every seasoned Machine Learning Scientist can attest, the key to building a good AI model is good data. But the cost and labor of assembling a high-quality dataset can be daunting. For this model, it would’ve involved going through entire call transcripts, many of them long and some containing transcription errors, and determining the point at which the purpose of the call was stated.

It would’ve taken months and been quite tedious to collect.

But our customers were interested! They wanted this ability to detect the purposes of calls.

So, we started building a more powerful model (the next stage in feature development).

The deep learning model

The heuristics model had high precision but a low recall, which means it didn’t fire as often as we would’ve liked it to — but when it did, it was very accurate. This is actually a common issue with heuristics models because rules are rather blunt, inflexible tools that can’t possibly account for the dizzying amount of variables that go into making a prediction.

For instance, a rule to focus only on, say, the start of a call to detect a call purpose, will miss the times when a call purpose is stated much later in the call, and a rule focused on matching a statement will miss the cases when a call purpose appears as a question.

Heuristics models can also be painful to maintain since the rules that make up the model need to be updated regularly, like when the data that the model consumes changes in content or structure. But because of their limited scope, rule-based models can be fairly accurate for the most obvious cases and are cheap to run.

This setup actually proved itself very capable of generating a rich training dataset of sections of call transcripts labeled as Purpose of Call statements, with a human in the loop validating subsets of the data to ensure quality.

The dataset was then augmented with linguistic patterns that were missed by the rules-based model. Now, we could use it to train a deep learning model that scores utterances for how likely they are to express a call purpose.

While the previous model was making a binary decision — hit or miss — this new, probabilistic scoring model is capable of finding the most informative call purpose statement, one that is updated in real-time once an utterance that better represents the conversation enters the system.

Thus, the following utterance appearing at the start of a conversation is a valid call purpose:

Hello. Yes, I would like to speak to the manager about a recent delivery issue.

But, there is a more informative call purpose appearing later that the new model is able to capture:

I will explain why I am calling. Last week I got a notification that a bike I ordered, the X-5 model, will be delivered on Tuesday and I had to stay home to accept the delivery. I had to take a day off work and waited all day, but it didn’t happen. The bike only arrived on Thursday this week and in broken packaging, there are scratches everywhere. I want help with this issue.

Another challenge the previous model was facing was detecting an implicitly stated call purpose. For example, can you guess if “Seems like I am having trouble logging in” is the reason the customer is calling — or a step in solving a different, bigger problem?

The rule to determine this will be overly complex and inflexible, but the deep learning model was able to learn these nuances from the data and score the utterances accordingly.

This setup, however, had its own weakness: a deep learning model is very resource-hungry and time-consuming to run. To reduce the load, we developed simple rules that filter the data that is fed to the deep learning model so it doesn’t have to run on every section of the conversation. This filtering mechanism made the model fast and efficient.

This hybrid model, which is currently in production, has both high precision and recall, is able to capture a broader range of call purpose statements, is less dependent on the quality of a transcript, and is far easier to maintain than the rules-based model as it adapts seamlessly to changing data.

Simple and concise

Call purpose statements, especially when uttered early in the call, may also serve other goals: greeting, establishing contact, exchanging pleasantries, making sure the technology is working properly (Can you hear me?!), making the call purpose hard to spot. So, we crafted some rules to cut the fluff out of these statements. A statement that used to look like this:

Doing great, thank you. Good afternoon. This is Benjamin with Dialpad. Yes, I can hear you. I’m reaching out about our new product offer. Is this a good time to call?

Will now be as simple as “I’m reaching out about our new product offer.”

Special thank you to my colleague, Elena Khasanova.

A tale of two models

Lessons in Language: How we built AI models that determine a call’s purpose

The simpler rules-based model

The deep learning model

Simple and concise

Written by Pooja Hiranandani