Conversational AI with BERT Made Easy

Plug and play with transformers using Rasa and Huggingface

Brandon Janes
Towards Data Science

--

For over a year I have been trying to automate appointment scheduling using NLP and python, and I finally got it to work thanks to an amazing free open-source dialogue tool called Rasa.

The truth is a year ago, when I started this project, the tools I am using now were hardly available. Certainly not in the easy-to-use form that we find them today. As a testament to how quickly this science of artificial intelligence and NLP is moving, this time last year (June 2019) transformers like BERT had barely left the domain of academic research and were just beginning to be seen in production at tech giants like Google and Facebook. The BERT paper itself was only published in October 2018 (links to papers below).

Today, thanks to open-source platforms like Rasa and HuggingFace, BERT and other transformer architectures are available in an easy plug-and-play manner. Moreover, the data scientists at Rasa have developed a special transformer-based classifier called the Dual Intent Entity Transformer or DIET classifier that is tailor-made for the task of extracting entities and classifying intents simultaneously, which is exactly what we do with our product MyTurn, an appointment scheduling virtual assistant developed with my team at Kunan S.A. in Argentina.

DIET transformer architecture diagram/ flow chart
DIET architecture diagram (image|Rasa)

The proof is in the p̶u̶d̶d̶i̶n̶g̶ f1 score

For quantitative folks out there, after I replaced ye ole Sklearn-based classifier with DIET my F1 score surged in both entity extraction and intent classification by more than 30 percent!!!! Compare SklearnIntentClassifier with diet_BERT_combined in the figures below. If you have ever designed machine learning models, you’d know that 30 percent is huge. Like that time you realized you had parked your car on the hose to the sprinkler. What a wonderful surprise when things work the way they should!

Statistical analysis of four machine learning models.
Model evaluations (image|bubjanes w/ streamlit)
Raw numbers from evaluations of four machine learning models.
Intent classification raw numbers (image|bubjanes)

The blueprint for intelligence: config.yml

Is it science or art? It’s neither. It’s trying every possible combination of hyperparameters and choosing the configuration that gives you the highest metrics. It’s called grid search.

It’s important to note that BERT is not a magic pill. In fact, for my specific task and training data, the BERT pretrained word embeddings alone did not provide good results (see diet_BERT_only above). These results were significantly worse than the old Sklearn-based classifier of 2019. Perhaps this can be explained by the regional jargon and colloquialisms found in the informal Spanish chats of Córdoba, Argentina, from where our training data was generated. The multilingual pretrained BERT embeddings we used were “trained on cased text in the top 104 languages with the largest Wikipedias,” according to HuggingFace documentation.

However, the highest performing model we obtained was by training custom features on our own Córdoba data using DIET and then combining these supervised embeddings with the BERT pretrained embeddings in a feed forward layer (see results in diet_BERT_combined above). The small diagram below shows how “sparse features,” trained on Córdoba data, can be combined with BERT “pretrained embeddings” in a feed forward layer. This option is ideal for Spanish language projects with little training data. That being said, the combined model performed only slightly better than the model that used DIET with no BERT pretrained embeddings (see results for diet_without_BERT above). This means for non-English language chatbots with a moderate amount of training data, the DIET architecture is probably all you need.

A diagram which shows how supervised word embeddings are combined with pretrained BERT word embeddings.
Diagram of combining custom supervised word embeddings with BERT pretrained word embeddings coming together in a feed forward layer. (image|Rasa)

Plug and play, for realz

After installing Rasa, and building an assistant to your needs (I suggest watching Rasa’s YouTube tutorial before doing this), the implementation of BERT embeddings is so easy it is almost disappointing.

Below is an example of the configuration file we used. The long list of hyperparameters may seem overwhelming, but trust me, this is much easier than it was a year ago.

Available hyperparameters from Rasa (image|bubjanes)

You must download the BERT dependencies:

pip install "rasa[transformers]"

To integrate BERT or any of the other pretrained models available on the HuggingFace website, just replace the model_weights hyperparemeter in following line with whatever pretrainined embeddings you want to use.

- name: HFTransformers NLP 
model_weights: “bert-base-multilingual-cased”
model_name: “bert”

We used bert-base-multilingual-cased because is was the best model available for Spanish.

See our github for full examples of configuration files mentioned in this article and additional links.

Conclusion

The beauty of Rasa is that it streamlines model training for Natural Language Understanding (NLU), Named Entity Recognition (NER) and Dialogue Management (DM), the three essential tools needed for task-oriented dialogue systems. Although we did a lot of good programming to make our system work as well as it does, you could probably get away with about 80 percent of building a Rasa virtual assistant without any real Python skills.

With exciting advancements in NLP, such as transformers and pretrained word embeddings, the field of conversational AI has leaped forward in recent years, from bots that say, “Sorry, I don’t understand,” to truly becoming feasible solutions to daily tasks that once required tedious human work.

Philosophically, the goal of this technology is not to replace humans with robots, but rather to assign the repetitive and “robotic” daily tasks, such as data entry or appointment scheduling, to virtual assistants, and reserve the brainspace of humans for the types of work that require skills that only humans have, such as creativity and critical thinking. MyTurn is a simple but prescient example of how conversational AI is not a tool reserved for Big Tech companies, but is in fact accessible to everybody through free and open-source technologies like Rasa and HuggingFace.

Suggested readings:

Tom Bocklisch, Joey Faulkner, Nick Pawlowski, Alan Nichol, Rasa: Open Source Language Understanding and Dialogue Management,15 December 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Attention Is All You Need, 6 December 2017

Jianfeng Gao (Microsoft), Micahel Galley (Microsoft), Lihong Li (Google), Neural Approaches to Conversational AI: Question Answering, Task-Oriented Dialogues and Social Chatbots, 10 September 2019

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 11 October 2018

--

--

Freedom of information journalist-turned data scientist, interested in bringing data to life through machine learning and Python.