How we Created an Open-Source COVID-19 Chatbot

A deep dive into AI and Natural Language Processing which takes you along on our journey of developing this Corona-chatbot

Emma Schreurs
Towards Data Science

--

Picture by Anne Heijkoop from Triple

The coronavirus outbreak has major consequences for society worldwide. People are rightly concerned and have many urgent questions. The World Health Organization provides answers to frequently asked questions regarding the coronavirus on their website (link). However, you may have to search for a while before you have found the right answer to your question. It is vital that people are well informed about current measures. This way we can efficiently limit mass spread. A chatbot could perfectly help with this!

In this blog post, we are going to create a chatbot for questions regarding the coronavirus. This will be based on artificial intelligence (AI) and natural language processing (NLP). First of all, we will learn more about the AI and NLP used for this chatbot. Hereafter, the implementation of the chatbot will be demonstrated based on the discussed techniques. The source code and dataset can also be downloaded from the corresponding GitHub page ( link).

This blog post is for companies in the technology sector who want to help stop the coronavirus and learn a bit about AI and NLP in the process. In recent years, there has been a lot of innovation in the field of AI and NLP. So first it is good to take a deep dive and learn a bit more about how to make a state-of-the-art chatbot.

For readers who are more interested in implementing the chatbot and want to skip the theoretical part, I recommend to continue reading from: “Implementation USE and chatbot on data COVID-19.”

Table of Contents

What is a chatbot?

A chatbot can no longer be called an exceptional luxury in the year 2020. I can say with some certainty that almost everyone, whether consciously or unconsciously, has been in contact with a chatbot. For example, you probably asked questions to digital customer services or other online chat services. During such a conversation, the chatbot knows how to answer your questions, ask questions, or possibly refer you to a website where your questions can be answered. Chatbots are also to be found on websites and instant messenger programs (including Skype, Facebook Messenger, and Slack).

In the past, chatbots were based entirely on pre-programmed rules, so they had nothing to do with artificial intelligence (AI). Today, however, AI is increasingly used in implementing chatbots, thanks to the enormous developments in natural language processing (NLP). This development is essential because for a chatbot to have some degree of “intelligence” it is necessary it has a certain understanding of language. But how do we teach a computer to understand language and how does a chatbot apply this language understanding?

With this blog, you will gain an insight into both the theoretical and technical background of a chatbot based on AI. I will explain how we learn a computer language using the Universal Sentence Encoder (USE), a state-of-the-art “deep learning model”. I will also demonstrate how you can build a chatbot yourself based on the USE in Python on a dataset with frequently asked questions about the coronavirus (from the WHO). In short, if implementing a chatbot is on your project list, this blog will provide a strong foundation.

Benefits of a chatbot

Using a chatbot offers many benefits for both the consumer and the producer.

First of all, the benefits for the consumer.

  • Proactive: the chatbot can start a conversation with a consumer based on, for example, his or her time spent on a website. It is possible that a consumer has been on a website for a long time and cannot find what he/she is looking for, at this moment a chatbot that asks whether you could use some help can come in handy.
  • Direct answer: the chatbot generates an answer within seconds, while telephone customer service or mail often has a longer waiting time.
  • Available 24/7: the chatbot does not have to sleep or take a break, this means that there is an available service at any time of the day.
  • Relevant information: the chatbot can directly guide a consumer and help find relevant information. For example, you can avoid a long search on an extensive website by asking the chatbot for management.
  • Involvement: the chatbot can show attractive pictures, videos, GIFs during a conversation, which stimulates the consumer to get more involved in the conversation.

Now the benefits for the producer:

  • Cost reduction: one of the most powerful benefits of a chatbot is that it saves costs (provided the bot works well). It offers cheap and fast customer service and prevents extreme waiting times. This means that a company can free up time for more complex questions and may require less manpower for customer service.
  • Satisfied consumers: based on the aforementioned benefits for consumers, you can assume that consumer satisfaction improves (provided the bot works well). This, in turn, offers advantages for organizations that implement the chatbot.
  • Feedback: The chatbot can ask for feedback at the end of each conversation. Either in the form of an open question or a multiple-choice question. This provides a platform that can process additional feedback and analyze it directly.
  • In the workplace: employees can ask the chatbot questions regarding business aspects within a company. Such as how do I change my password and who do I contact for a certain problem. This ensures efficiency within companies.

How does a chatbot work?

Now we know what a chatbot is, what it can be used for and what the benefits are for both the consumer and the producer. But how exactly does a chatbot work and how does a chatbot understand language? We will now go through this step by step, starting with the subject of language understanding.

Natural Language Processing (NLP)

As I mentioned earlier, NLP is one of the most important building blocks of a chatbot. NLP is the technology used to teach computers to “understand” human language. This technique is considered a difficult subject for AI, since the logic of human language is difficult to translate into code. Take a sarcastic comment or a joke, how do you teach a computer to understand this?

For starters, it is necessary to have a representation (encoding) of text that is interpretable by a computer. To obtain such a representation, a strategy is used that transforms “strings” into numbers, also called the vectorization of text. It is important that these vectors contain information regarding interactions / semantic relationships between sentences, as this is of importance for language understanding. For example, take the following sentences “I have a bit of a cold, could this be the new coronavirus?” and “I’m afraid I have COVID-19, since I have a cold.” These two sentences have the same meaning, but are formulated differently. It is important that, despite the differences, the computer understands that these sentences have a semantically similar meaning.

The computer can learn to understand this by means of ‘sentence encodings’, a commonly used method within NLP. The general idea is that we give a similar (but not identical) encoding to sentences that have the same semantics (such as the previous example sentences). So, these encodings are actually a kind of “barcode” for the meaning of a sentence. To see if two sentences are equal in meaning, we just need to see if these barcodes are about the same. This enables the computer to learn to recognize patterns within language and thus to interpret similarity between phrases such as “I have a bit of a cold, could it be that this is the new corona virus?” and “I’m afraid I have COVID-19 since I have a cold.”.

Later in the blog, we will take a deeper dive into how a computer eventually learns these encodings. For now, it is important to remember that this is done on the basis of a lot of available text data (such as Wikipedia, news feeds, etc.).

We know: if the encodings are about the same, they mean about the same. But how do we actually measure when something is “about equal”? I will explain this using the (left) image below. The sentence encodings, if these encodings took place in 3 dimensions, we could see as dots on this sphere (for the readers interested in the question “Why a sphere?”; This is because encodings are always vectors of length 1). A certain distance can be calculated between these dots. But what distance do we use for that? From linear algebra we often use the “cosine similarity”. This is the angle that two dots make when viewed from the center of the sphere. Viewed from the red dot, the blue dot has the smallest angle viewed from the center, so we could say that the sentence associated with the blue dot most closely resembles the red dot. If you want to learn more about cosine similarity, I can recommend this blog (link).

Left figure by Yang et al., (2019). Right figure by Triple

As I just mentioned, we calculate the (cosine of the) angle between these dots (encodings), to compare how semantically equal the sentences are. Since each encoding already has length 1, we only need to calculate the internal product. The internal product calculates the cosine of the angle between the red and the blue dot, resulting in a value. This value is representative of the degree of agreement between the two dots (encodings). The higher the cosine, the smaller the angle, so the higher semantic similarity. The above (right) figure may provide even better insight. The darker blue the box, the smaller the angle (so the higher the cosine).

Just a quick review. We now understand that with sentence encodings we are able to determine semantic relationships and thereby greatly improved the understanding of natural language for the computer. We also know how to determine those relationships between sentences using ‘cosine similarity’. But how exactly do we get those sentence encodings?

Sentence encodings are generated from a neural network. There are different types of neural networks, of which the Universal Sentence Encoder (USE) is currently one of the very best. In order to understand the USE and its associated Transformer architecture, it is first important to have a basic knowledge of neural networks. For readers who are already familiar with neural networks, I recommend skipping the next heading “Neural Networks”.

Neural Networks

A neural network is sometimes compared to the human brain, but whether this is entirely justified is debatable. Nevertheless, the comparison with the human brain provides a slightly simplified first indication of how a neural network works. A human neuron receives electrical signals from other neurons. When a certain threshold is reached, the neuron starts firing. A pulse of electricity passes through the axon to one or more other neurons. A neuron from a neural network also receives input signals (for example the black and white pixels of an image) and will activate when a certain threshold value is reached. Yet there is also a big and important difference between a human neuron and a non-human neuron. This difference lies in the fact that to this day we do not know exactly how a human neuron works and learns; we do know this about a neuron from a neural network. I will now explain briefly how a simple neural network works and learns.

A neural network is basically made up of three types of “layers” (see image). Where the first layer always contains the input variables, this is the initial data for the neural network. The second layer consists of hidden layer(s), this is the layer in which calculations are performed. In the third layer, an activation function is applied, which introduces non-linearity.

Figure by Emma Schreurs (2020)

As mentioned earlier, it is not known exactly how a human neuron learns. Luckily, the learning process of a neural network can be explained. A neural network namely “learns” by minimizing a cost function. This cost function provides a measure of model performance that lies in its ability to correctly estimate the relationship between the input and output variables. As shown in the right-hand side of the image above, the cost function calculates the difference between the predicted outcome variable and the true outcome variable. We want to minimize this difference because we want the model to accurately predict the true outcome. Gradient descent comes into play. Gradient descent is an optimization algorithm that searches for the local or global minima of the cost function. You could compare this to finding a valley in a mountainous landscape. In this valley, the “cost” of the difference between the predicted and true outcome variables is minimal. In other words, here our model best predicts the true outcome. But how do we determine which direction to walk to reach this valley? We do this through backpropagation. It calculates the so-called “gradient” of the cost function. This gradient determines the direction in which we must walk to reach the valley of the mountain. Backpropagation calculates this gradient using the “chain rule”, this calculates the derivative of the cost function in relation to the weights. Gradient descent then updates the weights and biases (these are parameters) of the neural network so that the cost function is minimized (the predicted outcome is closer to the true outcome). This is the basis and a short walk through the learning process of a neural network (see image). There are a number of bold terms, of which it makes sense if you don’t get it all at once based on this brief explanation. If you want to learn a bit more about this, I recommend watching 3blue1brown (link) on YouTube.

By now you have a basic understanding of sentence encodings and neural networks. Now the connection between the two remains to be established. So, it is about time to start specifying the neural network model applied to learn the sentence encodings used for our chatbot.

Universal Sentence Encoder (USE)

As I mentioned earlier, the USE is one of the best sentence encoders available at the moment. There are multiple versions of the USE, one of which is specifically suitable for creating a chatbot. This one is based on the transformer architecture and uses attention. What these new terms mean will be explained in the following paragraphs. In the article “Attention Is All You Need” you can find an extra detailed explanation (link).

First of all, the transformer architecture. The transformer architecture offers an innovative method to translate a sequence to another sequence using an encoder and a decoder, see the image. Unlike other commonly used sentence encoders, the transformer does not use recurrent networks. Until now, recurrent networks have been the best way to determine dependency between sequences, but that is about to change. Google AI presents an architecture without recurrent networks with only attention. This model achieves state-of-the-art results on various NLP tasks (translation, question-and-answer and sentiment analysis). See model architecture in the image below.

Figure by Vaswani et al., (2017)

In one fell swoop, these were a lot of new terms, which it is difficult to understand right away, for both the more experienced readers and the newcomers to the field. Hence, I will explain all bold terms in more detail. Starting with sequences and the encoder/decoder architecture.

Sequence-to-sequence learning (seq2seq) is, as the name implies, a neural network that transforms a sequence of elements into another sequence of elements. Seq2seq performs very well on translation tasks, in which a sequence of words is translated into a sequence of words in another language. Seq2seq models consist of a so-called encoder-decoder structure. The encoder processes the input sequence and transforms it into an n-dimensional vector (sentence encoding). This abstract vector is fed into the decoder, which transforms it into an output sequence. This output sequence can, for example, be in a different language. You can interpret the n-dimensional vector as a kind of common language. Suppose you feed the encoder a Dutch sentence that has to be translated into French in the decoder, then the n-dimensional vector serves as an intermediate language that is understandable for both the Dutch encoder and the French decoder. Nevertheless, both the encoder and the decoder will not initially be fluent in this common language. So, they will both need to be trained on a large number of examples, more on this later.

Now we understand seq2seq and the encoder-decoder structure. However, in the description of the transformer architecture I also talked about recurrent networks. To understand these, it is also important to understand what feedforward networks are (which the transformer uses). I will now explain this.

A feedforward network is actually the same as a neural network as previously described under the heading “Neural Networks”. An example of a feedforward network is the classification of pictures into cat versus non-cat based on previously manually classified data. The network will then minimize the cost function by means of gradient descent and backpropagation. Nevertheless, feedforward networks also have a serious shortcoming, for they have no perception of the order of time. This is because a feedforward network only considers what it is directly exposed to. Recurrent networks, on the other hand, are able to remember input from the past in addition to the current input they receive. A recurrent network is sometimes compared to a human, in that it has a memory that plays a role in decision making. Recurrent networks are therefore extremely useful and used often. However, the transformer architecture does not use recurrent networks. This makes it unable to remember when and how input was entered into the model. This is necessary, since every word in a sentence has a position relative to other elements in a sentence. Hence the transformer uses positional encoding, these positions are added to the n-dimensional vector of each word.

Finally, the importance and use of attention. Attention is a mechanism that looks at the input sequence in the encoder and determines which parts of the sequence are most important. In other words, attention, like people, remembers keywords from a sentence, in addition to focusing on the current word. We do this to understand the context of a sentence or story. So, attention is able to ignore the noise from a sentence and focus on relevant information. In addition, attention is able to remember relevant words that occur at the very beginning of a sentence or document.

The image below gives a clearer picture of attention. The red words are the words that are currently being read / processed, the blue words are the memories. The darker blue a memory is colored, the stronger the memory at that moment. At first attention was used as an addition to recurrent networks, but we have now found that attention can achieve state-of-the-art results in itself. If you want to learn more about attention, I recommend reading this blog (link).

Figure by Vaswani et al., (2017)

By now you may wonder how this transformer architecture is used to create sentence encodings. This goes as follows: a large neural network is trained on various language-related tasks (multi-task network). One common encoder is used for this, see figure below. The common encoder (see red in the figure) is the USE encoder with the associated transformer architecture. This multi-task neural network is trained on massive amounts of data, “learning” sentence encodings that are generic to all of these tasks. When training is finished, the common encoder is disconnected. This encoder can then be used for other specific tasks, such as a chatbot!

What makes the USE a multilingual sentence encoder is that a translation task has been added to the multi-task neural network. The left three tasks (Conversational Response, Quick-Thoughts and NLI) are all trained in the same language. A translation task, on the other hand, is trained on two languages. Hence it causes the encoder to create sentence encodings interpretable for the languages added in the translation task. The USE is trained on a total of 16 languages.

In short, we have now learned what sentence encodings are, how they are learned, what neural networks are and how the USE is structured. So, it is about time to demonstrate how you can apply the USE to your own dataset in Python.

Implementation USE and chatbot on data COVID-19

Now it’s time to demonstrate how the USE can be used to build a chatbot ourselves. The encodings of the USE are pre-trained and the model is available in Google’s TensorFlow Hub (link).

For starters, a short description of the dataset. In connection with these unfortunate corona times, we have chosen a relevant dataset. Namely, one from World Health Organization (WHO). As I mentioned earlier, there is a page on the WHO website which provides answers to frequently asked questions about the coronavirus. This is exactly the type of data which is suitable for a chatbot. The dataset we collected contains 86 possible answers to various topics regarding the coronavirus (including “What is a coronavirus” and “What are the symptoms of COVID-19”). The relevant data set and code are available at the corresponding GitHub page ( link).

Now it is time to apply the USE to the created dataset. To start, we load the module containing the USE model and we load the necessary packages, see code below. You will also see the top rows of the data frame, which contains all the answer options for the chatbot and the context of the answer options.

We want to use this module that we have just loaded to create “response encodings”. All possible answers to topics related to the coronavirus are available in the dataset above. These are used to create the response encodings. The USE model also offers the opportunity to provide context around the possible answers. The context can also be found in the above dataset. In the code below you can see how the response encodings are created based on the answers and context from the dataset.

Now that we have the response encodings, we can ask a question about the coronavirus and get an immediate answer! However, for this question, we also use the module (as a reminder, this contains the USE). A question encoding is made for every question you ask. Based on this question encoding, the algorithm can now search for the corresponding response encoding. This is actually the essence of what this chatbot does. The semantically most similar response is found using the inner product, as explained previously under “Natural Language Processing”. The code below shows a number of test questions, the question encoding is also created and the inner product is taken to find the answer which is semantically most similar to the question.

This chatbot gives a correct answer to all the above test questions. So that is very good news!! Nevertheless, we ran into some shortcomings. I will discuss these now.

Shortcomings

First, the shortcomings of the model. While testing the chatbot, it was noticeable that every question that contained the word “COVID-19” was given a strange answer. This can be explained by the fact that all sentence encodings are pre-trained. Since COVID-19 is a new virus (and therefore a new word), the USE has not yet been introduced to this during training of the encodings. This, therefore, indicates a shortcoming of the model, for it is not able to handle “unknown” words very well. We have now solved this problem by replacing the word “COVID-19” with “coronavirus”. The algorithm also does this automatically for every question. It is therefore not necessary to take this into account when asking a question.

Also, the chatbot gave strange answers when there were answer options in the dataset that were longer than three sentences. The USE performs mainly well on single sentences or short texts (max. three sentences). That the USE performs better on a few sentences is also stated in the user guide on TensorFlow. However, to give completely informative answers to questions concerning the coronavirus, long answers are sometimes necessary. A solution to this problem would therefore be very valuable.

Finally, the shortcomings of the chatbot. The current chatbot created for questions regarding the coronavirus is based on a somewhat small data set. This means that this chatbot is not able to answer every question. To make this chatbot even better, more data can be collected from WHO and the dataset should be kept up to date, for there are often updates regarding the coronavirus. As I previously mentioned, the chatbot would also be better if it was able to support longer answers. Unfortunately, this is not possible with the USE model.

Final word

Who knows, after reading this blog you may have become enthusiastic and want to get started with the dataset and code available on the corresponding GitHub page. The USE algorithm used in this blog post is one of many building blocks that can be used to build a chatbot. So, there are also other options for building a chatbot. If this blog is too technical for you, Triple also offers an alternative, the Wozzbot. This is an application that offers a non-technical way to build one or more chatbots. For information, visit www.wozzbot.com and contact the team at info@wozzbot.com.

Regarding the coronavirus, it is very important that people inform themselves as well as possible in order to limit further spread. A chatbot could make a large contribution to that. For example, people can type in an urgent question via an app to which they can get an immediate answer. In short, we can do more with this!!

--

--

Data Science intern at Triple and graduate student Behavioural Data Science at the University of Amsterdam, with a background in Psychology