Building effective FAQ with Knowledge Bases, BERT and Sentence Clustering

How to identify and expose the knowledge that matters

Published in

Towards Data Science

7 min readJun 15, 2020

A successful business is almost always driven by high-quality knowledge and expertise. Modern organizations expose their knowledge with conversational interfaces such as bots and expert systems so customers, partners, and employees will have immediate access to the knowledge that drives success. We, data scientists and engineers, are responsible to make that happens. We need to answer a simple question: How do you represent business knowledge so it is easy and simple to consume? There are many approaches and possible strategies for exposing knowledge. in this article I want to dig into the good old Frequently Asked Questions system and discuss how to implement it with the latest AI technologies.

In the prehistoric era, websites used to have this one FAQ page with a long and tedious list of useless questions. Only real optimists would ever search this list to find a possible remedy for an issue they face. Those years are gone. Today modern websites have an integrated bot that users use to ask questions. At first, those bots were pretty lousy so users were reluctant, but now as the technology (and AI) behind conversational interfaces improve, users, perceive bots and search as the primary interfaces to ask questions.

Knowledge Bases

Knowledge bases such as Azure QnA Maker and Google Dialogflow Knowledge base are two examples of AI technology that you can use to build a bot that will answer questions from a list of question-answer pairs. Those services can match user questions to the most appropriate entry in the list and respond with the right answer. You can also organize questions in a hierarchy and build a conversation flow that will guide the system to find the right QnA pair. This is great but still, at the core, there is a list of questions and answers. A list that someone needs to build and maintain. Such lists cannot be too long because long lists are impossible to maintain especially as the modern business is dynamic and the right answer changes all the time. Obviously, the list cannot be too short if we want the customer to find it useful and get an effective answer. So how do we find the balance and maintain the questions that matter?

NLU Questions Answering

Questions answering is one of the core tasks NLU networks are designed to solve. In the last few years, the NLU space was revolutionized with the introduction of the transformer and BERT-like architectures. For more details read my article about BERT and its implementations. Today with projects like huggingface/transformers QnA pipeline, we can build QnA systems that find a sentence in an article that can be considered as an answer to a question, with some level of confidence. This is great but there are two basic challenges.
1) BERT-like networks have fixed sized (limited) input size so it is not designed to handle very long text (i.e. context) such as books or even articles.
2) The answer is always a sentence from the original text. we all know from our day-to-day experience that often enough a good answer is at least a confluence of sentences from multiple places in the original text or even some new text that does not exist in the original text. Such an answer should be constructed from the ideas expressed in the original text. As of today, NLU question answering networks cannot deliver that, though this is an active area of research. For more details search for ELI5 research.
There are some partial solutions to the first challenge. At a high level to answer a question from a large corpus of text, we use a search index before the NLU network. In other words, we first index the documents with some search index, and then when a user submits a question we use the search engine to retrieve some (short) candidate pieces of text on which we will execute the BERT network. Finally, we will return the answer with the best confidence score. There are many such implementations out there, I would like to recommend the one created by train.

FAQ System Architecture

The immediate conclusion is that no one approach is perfect and therefore you will need probably need both to develop your FAQ application. First, you will work with your domain experts and develop a list of question-answer pairs assuming those are the “frequently” asked questions and thus they cover “most” of your customer needs. Answers will be deep and comprehensive as they were developed by human experts. Using one of the knowledge base technologies mentioned above you will implement the first layer of the system. Any question submitted by the user will be first served by the knowledge base. Hopefully, your assumption is right and most questions will be successfully handled here. Obviously, there will be questions that do not have an answer in the list. Those will continue to the next layer that is based on the NLU approach. You will collect and index all the relevant documents, manuals, and articles that might contain answers to your customers' questions. When a question arrives the system will find candidate articles and execute BERT-like question answering to come up with the answer. The execution time of the second layer will be much longer than the first one, and the answers will be less accurate but chances are that many answers will be good enough to satisfy the end-user. Hopefully, that will leave the percentage of unanswered questions pretty small.

Logging is the key

As we have seen the system is built on assumptions. You want to constantly validate your assumptions. To do this you must log every question that comes in. These logs will give you precious insights about what is that your customers are really looking for, and equally important will validate your assumptions. You will see how questions were answered, by which layer and maybe even get some sentiment from the end-user.

How to find the real FAQ

Assuming you logged it all, now you want to identify the questions that require further attention. Those are the questions that are not present in your knowledge base and are frequent enough so you will want to invest the time and develop answers that will find their place in the knowledge base. At first, you might think that it's a trivial task to count the appearance of text in your logs, but a second look will tell you that life is not that simple. It's amazing to see how many variations users can come up with just to ask the same simple question. To process large volumes you need to find all of these variations and cluster them to groups that make sense. Once clustered, it's important to save them so you could later feed the variations to the knowledge base as modern knowledge bases know how to group different variations into a single question. Still, you need to solve the problem of questions clustering.

Sentence clustering has two stages:

Convert the text into a numeric vector (i.e. sentence embedding) using a BERT-like language model.
Cluster the vectors using the clustering algorithm of your choice.

To execute the sentence embedding you need to insert your sentence into a BERT-like network and look at the CLS token. Fortunately, the SentenceTransformer project does all the heavy lifting so you can accomplish this complex task in two lines of code!!!

The second stage should be straightforward to anyone with little ML experience. I am using KMeans from sklearn.cluster.

def RunClustering(userQuestions, num_clusters):# extract the text line from the userQuestions object
scorpus = [item[‘QuestionText’] for item in userQuestions]# convert the text into a numeric vector (i.e. sentence embedding) using a BERT-like language modelembedder = SentenceTransformer(‘bert-base-nli-mean-tokens’)
corpus_embeddings = embedder.encode(corpus)# cluster the vectors
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_# organize the clusters
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
  clustered_sentences[cluster_id].append(userQuestions[sentence_id])return [{“cluster”: i, “questions”: cluster} for i, cluster in enumerate(clustered_sentences)]

Results from questions clustering — Results from question clustering

Now you can manually analyze each cluster and identify the questions to work on. Optionally you can use further clustering and similarity search on the vectors in each cluster using libraries like Faiss. I found that the first level of clustering was enough but maybe your scenario will require further granularity.

Conclusion

Knowledge bases are a great tool to support your FAQ bot, but the list of question-answer pairs on the backend can never cover all questions and it must be constantly maintained. NLU question answering can fill in the gap and with sentence clustering, you can identify the questions that matter.
I hope that this article will help you build your next successful FAQ system.

Final note: To be able to continue and publish new stories Medium now requires writers to have a minimum number of followers, so please help me continue to publish and press the ‘follow’ button next to my name.