Building a Game of Throne chatbot for Slack: Part 1 Understanding Language

Lessons learned applying deep learning to natural language understanding and incorporating question and answering

Isaac Godfried
Towards Data Science

--

This past summer I decided to test my NLP skills and undertake building a chatbot. As a big Game of Thrones fan I settled on creating a chatbot for Game of Thrones. Initially, my goal was just to provide a way to get various types of GOT news easily from places like Reddit, Watchers on the Wall, Los Siete Reinos, and Twitter. However, I quickly decided to expand into other tasks and I became particularly intrigued at how to integrate modern NLP approaches to enrich my chatbot. This first article will cover the natural language understanding and question answering components; and the second article will talk more about the platform and architecture.

If want to use the chatbot then you can join the Citadel community on Slack (though, as I will describe, I haven’t yet added all the features described into the actual production version). In the future I plan on adding support to install it on your workspace. Also be warned that some of the examples in this article and the content in the bot itself contain information through season 7.

Natural Language Understanding (NLU)

Deep learning for NLU: An Introduction

Where and how to integrate deep learning into a chatbot is actually a somewhat tricky question. On a fundamental level with chatbots you can use rule based methods, machine learning, or some combination of the two. With rule based methods you generally have a guarantee that the chatbot will respond properly to user queries as a long as users write in a formulaic and limited way. With machine learning (or even statistical NLP methods) you can break out of the rigid formulas and allow users to type more naturally. However, by doing this you also introduce uncertainty (even with the best models). Also even the SOTA models generally only work for limited types of dialogue. For instance, a model trained for goal oriented dialogue will typically break down if the user begins to engage in chit-chat. Deep learning of course also requires data and in this situation we often have a cold-start-problem. Therefore, you often need to write sample data of what you think users will ask. This process is both time consuming and often inaccurate.

The formualic way

Before looking at why we might need machine learning based models, let’s look at some of the limitations of using rule based methods. With the formulaic way for the chatbot we might write the code something like this:

Note this is purposefully simplfied. In my actual bot for my rule based methods I usually use a dict to map words to actions to avoid long if statement like these. Also you would obviously have to handle the DB operations etc
Example users requests and responses using the formulaic approach

Now users have to write in a very formulaic way to use the chatbot. They have to write exactly Quote Jon Snow or News Reddit This is fine if you only want to include simple queries. But what if what want to support functionality for phrases likeQuote about Jon Snow or even Quote from Jon Snow when he is talking about Stannis? Yes we could force users to do things in a formulaic way too but that quickly becomes complicated and burdensome for users. Similarly with news, supporting complex queries like Get news from the past 24 Hours on Reddit or even News about Jon Snow in Season 8becomes arduous at best and impossible at worst.

Slot filling and intent detection

This brings us to machine learning for slot filling and intent detection. A key area for using deep learning in chatbots is to automatically take a user’s inputted string and map the relevant tokens to slots for an API (this is known as slot filling or SLU). A related area is intent detection which focuses on mapping the utterance to an intent. Here is an example of a SLU annotated

Quote about Jon                 Snow
0 0 B-focus-character I-focus-character
Intent: Quote 0
Quote from Jon Snow when he is talking about Stannis
0 0 B-speaker I-speaker 0 0 0 0 0 I-focus
Intent: Quote 0

Slot filling in a sense is a more fine grained version of named entity recognition (NER). For instance, in a pure NER setting Jon Snow might always have the label character whereas for slot filling the label will change based on the slot he should occupy. The format of annotation is called IOB, this stands for inside-outside-begining. It is meant to show “chunks” of the tokens together.

As the bot’s response will depend on both the slots and the user’s goal many papers focus on joint slot filling and intent detection. Additionally many NLU libraries such as the Rasa-NLU framework provide joint SLU intent detection.

Once we have the slots filled we still need to construct the actual query. The query construction will depend on how your database is set up. As such, in most cases this code you write manually yourself. However, there are some models that learn a direct mapping of the utterance to a SQL query. The vast majority of the time though, you will have an existing API or want to construct one. So let’s look at how you might turn this into a simple API request :

def process_user_text(user_text, model):
# Let's assume model.predict() returns
# intent:str ents:dict (e.g. {"focus_character":"Jon Snow"})
intent, ents = model.predict(user_text)
# Assume this function combines multi-token ents an
# normalizes them to how they appear in the database
ents = combine_normalize_ents(ents)
url = "https://random_site.com/" + intent
requests.post(url, data=ents)
Note although this code resembles the code in the GOT-Bot APIs I have not personally tested this code. I plan on doing so in the next few days. But if you run into errors in the interim let me know.

Now we could use this simple Flask API to handle these requests.

Returning to our previous news example we would label data in the following format to work with the API:

News about Jon          Snow          in    Season       8
0 0 B-character I-character 0 B-season I-season
Intent: News 1

As you can see this format allows us to construct API calls and SQL queries much easier. Now we could define a function (assuming we had already run a NER and tagged the news stories).

Limited data scenarios

The problem of this approach, and of course deep learning in general, is the need for large amounts of labeled training data. One approach that I’m currently researching is the use of meta-learning on many annotated dialogue datasets in order to enable the model to rapidly adapt to just a few examples.

Slot alignment is another interesting (although somewhat limited) approach. Towards Zero-Shot Frame Semantic Parsing for Domain Scaling, a 2017 article by Google Researchers, described using the names of the slots and/or documentation of the slots in the API to effectively perform zero shot filling. The idea is that if the model was already trained on booking an airline then it should also be able to book a bus as the slots should generally overlap (i.e., both would have a start_city, destination_city). Taking this idea a step further a restaurant based dialogue system might have a restaurant_city (i.e., book me a restaurant in Chicago) and a hotel might have hotel_city . By exploiting similar semantics between phrases a model could learn to fill restaurant_city effectively, even though it was only trained on airline booking data. Of course this approach also has limitations: (1) it cannot work on drastically different domains with little to no overlap; (2) in some cases there actually could be negative transfer (e.g., it performed worse on taxi booking; it confused drop_off and pickup_spot because these are context dependent; even though these could align with start_city and destination_city their representation are not similar). For my use case this approach would likely not work as there are few overlapping semantic slots between the large public slot filling datasets and my GOT chatbot.

Beyond joint slot filling and intent detection models

But even joint NLU models have their limitations as they do not use context. For instance, suppose the user stated Quote from Robert Baratheon and then said Get me another quote from him. In this scenario one of the NLU models previously described won’t know what to do as it does not use conversation history. Similarly, a user might ask the question Who is Jon Snow's mother? the bot would (hopefully) return Lyanna Stark then if the user asked When did she run off with Rhaegar? it would likely not even cast her to a slot. There might also be times when we need to update or ask for additional information about certain slots. For instance, if the user asked for News from the past 24 hours about Season 8? but the API required the news source to be specified the bot might replyFrom what source? Or alternatively if the user stated Get the scene from episode 2?,the bot might then reply from what season?

End-to-end dialogue models should be able to handle these tasks. One challenge that was created to measure the progress at this task was the Dialogue State Tracking. In particular DSTC2, the second version of the challenge, measured how well models could issue and update API calls and request additional information from the user when needed. One of the first models to do well on this challenge was Memory Networks adopted for goal oriented dialogue. This was done by researchers from Facebook in the paper Learning End-to-End Goal-Oriented Dialog. They showed that Memory Networks outperformed other machine learning methods by large margins.

More recently there have been papers like Mem2Seq that actively incorporate dialogue history with the knowledge base and use them both in response generation. Specifically, Mem2Seq has two parts, a memory encoder which encodes the dialog history and a decoder that uses the encoded dialogue/KB to generate a user response. Mem2Seq acheived SOTA results on the DSTC2 challenge, BABI, and the in-car stanford dataset.

The architecture of Mem2Seq notice how both dialog history and the knowledge base are encoded that is utilized at each turn.

To actually train Mem2Seq for GOT-Bot requires three things: a knowledge base, annotated intents, and slot annotated dialogue histories. This makes it harder to adapt to GOT-Bot as the KB needs to be converted into triplets such as (person, person2, relation).

Question Answering

The line between where question answering begins and slot filling ends is often quite blurry. In research terms we usually see QA as referring to the answering of a question based on unstructured textual data. (It can also be based on a structured knowledge base, but in this circumstance it is particularly confusing where exactly slot-filling ends and QA begins). In the former, this usually means searching and extracting the answer from textual data rather than figuring what slots to fill to query a database. In the context of the Game of Thrones bot it means taking a user question, searching the proper indices on ElasticSearch, and then extracting the correct answer from the returned results. Before going into how exactly let’s look at different types of questions a user might ask:

Essentially there are three categories of questions:

(1) Questions that can be answered by querying the knowledge graph.

Who has the Hound killed?

Who is Jon Snow's father?

What is the motto of house Glover?

Who was Margeary married to?

What region is Harrenhall in?

These questions all have known answers that can be found in a structured knowledge graph. The problem is that we need to turn the user query into SQL or an API request. This is similar to what we need to do with slot-filling. In many cases we can actually cast this as a slot filling problem via phrasing questions as another intent. For instance,

Who has the Hound               killed 
0 0 0 I-focus_character I-attribute
Intent kb_question`

Or in the case of the following question:

What region             is   Harrenhall       in?
0 I-location-region 0 I-focus_castle 0
Intent

We could then construct an API request in a similiar fashion. However, there is an abundance of datasets with questions to SQL, so in this instance it might make sense to use one of those datasets.

(2) Questions not in the knowledge graph but that still have known answers and can be extracted from the MediaWiki pages or other GOT sites.

How did the war of the five king's start?

What happened during Harrenhal tourney?

What was the war of the five kings?

How did Robert's rebellion end?

Who got the Tyrell's to support the Lannisters?

The most relevant datasets/models for this task are datasets like MS MARCO and TriviaQA. Although many researchers evaluate on SQUAD, in reality you are going to almost never have the exact context paragraph given to you. This makes models that perform well on MS MARCO ideal as they are given a whole list of ranked results and have to extract the correct answer from them.

The QuAC dataset or Question and Answering in Context is similar to the previously mentioned “end-to-end” dialogue models for question and answering. It contains questions and follow up questions that involve multiple dialogue turns. Models like FlowQA can work well at this conversational QA task as they add dialogue history to the base model.

(3) Questions where the answer is subjective or speculative and that require finding similar questions or alternatively performing multi-hop inference.

Why did Sansa trust Joffery?

Who will survive season 8 and why?

If Robb Stark hadn't broken his marriage pack would've the Freys betrayed him?

Who will kill Cersei?

Is Jon the prince that was promised?

These questions have no definitive answers and require either analysis or speculation. Therefore the best solution is to find similiar questions that were already answered. This can be done through the scraped Quora index. However, here we will not use a QA model but a question similarity model. The question similarity can be done using a variety of methods. My current model in production uses a basic ElasticSearch and then reranks the results using the Universal Sentence Encoder + cosine similarity. In order to gather more data to improve ranking the bot currently shows the user all of the top ten results. We can then retrain the model based on the user’s choices. However, there are several problems with this approach. First, in many cases the initial ElasticSearch often does not return good questions. Second, users might return another interesting answer that does not directly answer their question. Still this “weak supervision” means that one can manually annotate the examples much quicker later.

Example questions and bot returned answers. In the first panel (from the left) the correct answered is returned as (1) when it should be returned as 0 (it is likely some type of bug as that question seems very out of place). In 2/3 the correct answers are not even found by ElasticSearch but the related results

Creating a good on-boarding process

Creating a good on-boarding process is also essential to getting users. Your bot needs to make an immediate positive impression or else people will move-on. For this reason in order to make a good on-boarding process I decided to write a rule based conversation. The bot first introduces itself with a Direct Message welcoming them to the Citadel. Throughout the on-boarding process the user’s state is tracked in Redis. At the end of each response the user’s state is updated in the Redis. Here I decided to use simple dictionaries to map the user’s state to actions in order to avoid lengthy if statements.

The on-boarding process aims to get users acquainted with the basic features of the bot in a fun and friendly manner.

Maester bot message to new users

One problem with manually defined rules is if the user says something unexpected or even slightly different than what you hard-code, the bot will fail. I found this out the hard way when I accidentally let a bug slip past my unit tests and my manual tests. I was expecting users to respond yes to the question Would you like me to show you around the Citadel However users often responded with things like yes thanks or yes please a really simple error that I did not catch. That is why I recommend having a variety of people beta-test your chatbot because you may inadvertantly miss some things.

Response generation

I didn’t talk too much about the actual response generation in this article. For the most part this is done by recombining the responses of the previously described elements with basic phrases. There are of course many more sophisticated methods to generate responses that are unique and change over time. However, right now I’m still using simple phrases to combine the results of NLU calls from the API.

What about chit-chat and non-goal oriented interactions?

This is an area I have not researched that much but I hopefully will be able to dive into subsequent additions. Essentially, this is when the user doesn’t want to accomplish a specific tasks but just wants to chat about elements of Game of Thrones in general and hear witty/interesting response from the bot.

The Current State of the Bot and future improvements

Currently, the chatbot is still in the formulaic state. I haven’t been able to annotate enough training data or incorporate meta-/unsupervised learning effectively enough to make slot filling perform consistently. However, my trained models are getting better and I’m hoping to roll out an update soon which incorporates them. I’m also looking at training Mem2Seq to handle the whole dialogue process via meta-learning, however this is in the more distant future.

In terms of question and answering the searching of the Quora index is still very poor and there is no support for querying the knowledge base. I’m hoping to improve QA question ranking of the Quora index using the BERT Reranker that was pre-trained on MS MARCO. I’m hoping to rewrite the news system so you can ask for things like “latest about Season 8” or “new Jon Snow memes from Reddit.” Finally, I’m adding some rule based dialog flows for more realistic chat sequences. In part two of this series I will go into more practical aspects of the chatbot such as the platforms and tools used.

--

--