Creating a Smart Chat Bot that Talks Like You

An efficient approach to creating a chat bot with an RNN encoder-decoder and pre-trained word embeddings

Florian Glufke
Towards Data Science

--

Photo by Andrea Piacquadio from Pexels

Introduction

In this article, an approach to creating a chat bot with pre-trained word embeddings and a recurrent neural network with an encoder-decoder architecture is described. The word embeddings are pre-trained which means they do not need to be learned but simply loaded from a file into the network. To learn the response of the chat bot to a certain request, an encoder-decoder network is implemented. The chat bot is created using TensorFlow together with the Keras API. In this article, you can learn about these concepts and how you can use them to create a chat bot that talks like you.

Introduction to the Applied AI Technologies

As word embeddings are used in this approach, the following short introduction to this AI technology is provided: A word embedding is a vector with real numbers to describe or represent a certain word. Usually, the vectors have a size of 50 to 300 elements. In the case shown in this article, each word is represented by a 300-dimension vector. This vector can be used as input for a neural network. That is why a word written in letters is not fed into the neural network, rather its representation as a word embedding. Word embeddings are learned by processing a huge amount of data (e.g. all Wikipedia articles of a certain language can be used for this purpose). The word embedding of a word is learned based on its context, i.e. the words that surround it. Thus, words with a similar context have a similar word embedding. As training of such word embeddings requires great effort and huge amounts of data, pre-trained word embeddings are used. As word embeddings for a particular language are needed, the availability of pre-trained word embeddings for the desired language should be determined. For the examples described in this article, training data in German and German word embeddings are used. These can be found at deepset (https://deepset.ai/german-word-embeddings, thanks to deepset for creating the word embeddings and letting me use them) and are generated with the GloVe-Algorithm from the German Wikipedia. A detailed description of the GloVe-Algorithm and further information about it can be found in [1]. You can learn how to use such pre-trained word embeddings in this article. The process is called transfer learning because we use already learned vectors and do not change them during training. Further information about word embeddings can be found here in [2].

Another important AI technology in this article is the recurrent neural network. A recurrent neural network is typically used to process sequences of data. e.g. a sequence of weather data (the temperature for each day over a period of 30 days) or stock data (the daily price of a share over a period of one year). The recurrent neural network can output a single value or another sequence. In the case of the weather data this could be the forecasted temperature for the next 30 days or for the stock data the anticipated price of the share on the next day. There are mainly two different types of units used to build a recurrent neural network: LSTMs (Long Short Term Memory) and GRUs (Gated Recurrent Unit). They both have in common that they compute an internal state which is passed from one LSTM or GRU unit to the next. In the examples shown in this article, LSTMs will be used. However, the described approach should also work fine with GRUs. LSTMs were first introduced by Sepp Hochreiter and Jürgen Schmidthuber. You can find further information about them [3]. GRUs can be seen as simpler versions of the LSTMs. You can get further information about them in [4].

In the case shown below, the sequence is a request to the chat bot. This request is split into single words. Thus, the sequence consists of several, single words. The output sequence of the recurrent neural network is the response of the chat bot. As the request and the response do not have the same number of words, a special architecture of recurrent network is used. This architecture is called an encoder-decoder network. Despite the fact that this approach was originally designed for language translation, it works in the case described in this article as well.

The encoder-decoder network was first introduced in [5] to translate English sentences into French. A similar approach can be found [4]. Basically, we have two recurrent networks in this architecture. The first recurrent network is called the encoder. It encodes the input sequence into an internal state, representing it as a vector with a fixed length. This state is used as an initial state for the second recurrent network, the decoder. The output of the decoder is fed into a simple neural network which has a softmax activation function. The output of this neural network is a probability distribution, created by the softmax activation function, over the whole vocabulary for each word in the output sequence. This is explained in more detail later in this article.

Getting Training Data

Since we want the chat bot to talk like you, some training data is needed that contains conversations with you. Chat protocols from messenger apps are a good source for this purpose. Usually, you can export chat protocols as CSV files. These CSV files need to be processed so that there are requests to you and the corresponding responses from you. The requests are the input for the encoder-decoder network and the responses are the expected outputs. Thus, two arrays are needed — one with requests (x_test_raw) and one with the corresponding responses (y_test_raw). These arrays need to be pre-processed, so that any punctuation, upper case letters and special characters are deleted or replaced.

In the following code snippet you can see how a CSV file can be pre-processed to get training data:

x_train_raw = []
y_train_raw = []

with open("drive/MyDrive/messages.csv") as chat_file:
first_line = True
is_request = True
last_request = ""
csv_chat_file = csv.reader(chat_file)
for row in csv_chat_file:
if first_line:
first_line = False
continue
else:
if row[CSV_IS_OUTGOING_ROW_INDEX] == "0":
last_request = row[10]
is_request = True
if row[CSV_IS_OUTGOING_ROW_INDEX] == "1" and is_request:
x_train_raw.append(re.sub(r"[^a-zäöüß ]+", "", last_request.lower()))
y_train_raw.append(re.sub(r"[^a-zäöüß ]+", "", row[10].lower()))
is_request = False

The Embedding Layer and Vocabulary

For the purpose of implementing word embeddings, the Keras API provides an embedding layer. The pre-trained word embeddings are loaded into this embedding layer. This approach is called transfer learning, as we use already learned word embeddings. The word embeddings are saved in a text file. Thus, the word embeddings need to be loaded from that file, processed so that they fit to the expected data structure of Keras and loaded into the embedding layer. Along with the loading of the word embeddings, the vocabulary needs to be defined. To define the vocabulary, a Python dictionary is created which contains an index for each word as the value and the word itself as the key. This dictionary will be used later to convert our training data into arrays which contain the index of each word instead of the written words. This index is used by the embedding layer to look up the corresponding word embedding. The following special words need to be put into the vocabulary as well:

  • <PAD>
  • <START>
  • <UNKNOWN>

The purpose of these is explained further on in this article.

The following code snippet shows how the word embeddings are loaded from a file and how the dictionary for the vocabulary is created:

word_embeddings = {}
word_index = {}

#Add special words to vocabulary
word_index[PAD_TOKEN] = PAD_INDEX
word_index[START_TOKEN] = START_INDEX
word_index[UNKNOWN_TOKEN] = UNKNOWN_INDEX

word_embeddings[PAD_TOKEN] = [0.0] * 300
word_embeddings[START_TOKEN] = [-1.0] * 300
word_embeddings[UNKNOWN_TOKEN] = [1.0] * 300

index = VOCAB_START_INDEX

with open("drive/MyDrive/vectors.txt") as vector_file:
for line in vector_file:
word, embedding = line.split(maxsplit=1)
embedding = np.fromstring(embedding, sep=" ")
word_embeddings[word] = embedding
word_index[word] = index
index += 1
if index == VOCAB_SIZE:
break

As it is not desired to load the whole file, loading stops when a defined number of words in the vocabulary is reached. As the words are ordered in the frequency they occur, only a certain number of words can be loaded; for example the first 20,000 words. Thus, for the case described in this article, the most frequent 20,000 words in the German Wikipedia are defined as our vocabulary.

After the word embeddings are loaded from the file, they need to be loaded into the embedding layer of Keras. You can see this part in the following code snippet:

embedding_matrix = np.zeros((VOCAB_SIZE, embedding_dim))

for word, index in word_index.items():
word_embedding = word_embeddings[word]
embedding_matrix[index] = word_embedding

embedding_layer = Embedding(VOCAB_SIZE,
embedding_dim,
embeddings_initializer=Constant(embedding_matrix),
trainable=False,
mask_zero=True,
name="embedding")

It is important that trainable is set to False. Otherwise the word embeddings will be changed during training which is not desired since they have already been trained. Another important parameter is mask_zero=True. This parameter masks the word with the index zero, so that it is not used for training. The word with index zero is the special word “<PAD>” which is used for padding. How this is done, is explained in the next section.

Preparing the Training Data

Once the vocabulary is defined, the training data can finally be processed so that it can be used for training. For this purpose, every word in the two arrays (x_test_raw and y_test_raw) is replaced by its corresponding index in the vocabulary. As a fixed length is necessary for the encoder as well as the decoder input, sentences with a word count higher than the size of the input are truncated and sentences with a lower word count are padded. For this purpose, the special word “<PAD>”, with its index zero, is used. During training, the expected output must be input into the decoder as well, whereby it must be modified. To do this, the array y_test is taken, every sentence in it shifted by one, and the index of “<START>” is inserted into the first element of each sentence. In the case we do not find a word in our vocabulary, we use the index of “<UNKNOWN>”. As the response of the decoder-network is not an array of indexes, but a vector containing the probability of the next word for every word in the vocabulary, the training data needs to be converted further. For this purpose, the Keras API function to_categorical() can be used. This function generates a one-hot-encoded vector out of the array with indexes.

def sentences_to_index(sentences, sentenc_length):
sentences_as_index = np.zeros((len(sentences), sentenc_length), dtype=int)
tmp_sentences_index = 0
tmp_word_index = 0
unknown_count = 0
for sentence in sentences:
words = sentence.split(" ")
for word in words:
current_word_index = word_index.get(word)
if tmp_word_index == sentenc_length - 1:
break
if current_word_index != None:
sentences_as_index[tmp_sentences_index, tmp_word_index] = current_word_index
else:
sentences_as_index[tmp_sentences_index, tmp_word_index] = UNKNOWN_INDEX #Word is not in vocabulary, use the index of the unkown toke
unknown_count += 1
tmp_word_index += 1
tmp_sentences_index += 1
tmp_word_index = 0
print("Unknown count: " + str(unknown_count))
return sentences_as_index

x_train_encoder = sentences_to_index(x_train_raw, MAX_INPUT_SENTENC_LENGTH)
y_train = sentences_to_index(y_train_raw, MAX_OUTPUT_SENTENC_LENGTH)
x_train_decoder = np.roll(y_train, 1)
x_train_decoder[:,0] = START_INDEX
y_train = to_categorical(y_train, num_classes=VOCAB_SIZE)

Defining the Models

After the embedding layer is created and the word embeddings are loaded into it, the models can be defined. In this approach, three models are needed. One model is the training model which will be used to train the chat bot. The other two models are used to get the response of the chat bot after training is completed. They are called the inference models. All the models share the same layers. Thus, the weights of the layers, which are learned during training, are used by the inference models. First, it is helpful to look at the creation of the layers needed:

#Define the layers of the encoder
encoder_input = Input(shape=(MAX_INPUT_SENTENC_LENGTH), name="encoder_input")
encoder_lstm = LSTM(LSTM_UNITS_NUMBER, return_state=True, name="encoder_lstm")

#Connect the layers of the encoder
encoder_input_embedded = embedding_layer(encoder_input)
_, state_h, state_c = encoder_lstm(encoder_input_embedded)
encoder_state = [state_h, state_c]

#Define the layers of the decoder
decoder_input = Input(shape=(MAX_OUTPUT_SENTENC_LENGTH), name="decoder_input")
decoder_state_input_h = Input(shape=(LSTM_UNITS_NUMBER,),
name="decoder_state_h_input")
decoder_state_input_c = Input(shape=(LSTM_UNITS_NUMBER,),
name="decoder_state_c_input")
decoder_state_input = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = LSTM(LSTM_UNITS_NUMBER,
return_sequences=True,
return_state=True,
name="decoder_lstm")
decoder_dense = Dense(VOCAB_SIZE, activation='softmax', name="decoder_dense")

#Connect the layers of the decoder
decoder_input_embedded = embedding_layer(decoder_input)
decoder_output, _, _ = decoder_lstm(decoder_input_embedded,
initial_state=encoder_state)
decoder_output = decoder_dense(decoder_output)

To get a better understanding of the different layers, the architecture is shown in the following diagram:

Figure 1: Architecture of the RNN encoder-decoder network. Image create by the author.

As can be seen in Figure 1, the embedding layer is the first layer of the encoder model and the decoder network. The same embedding layer can be used for both the encoder and the decoder network because both use the same vocabulary. The output of the embedding layer is fed into the encoder network which consists of an LSTM layer with 1024 units. It is important that return_state is set to True because this state is necessary as input for the decoder network. Thus, the state of the encoder network is passed as the initial state into the decoder network. Furthermore, the decoder network receives the expected output sequence as input. The last layer of the decoder network is a dense layer with a softmax activation function. This dense layer provides the probability of which word will come up next for every word in the vocabulary. Thus, the output of the dense layer has the size of our maximum output sentence length and, for each element in the sentence respectively, the size of the vocabulary.

After all the layers are created and connected, we can define our training model, as shown in the following code snippet:

#Define the training model
training_model = Model([encoder_input, decoder_input], decoder_output)
Figure 2: Architecture of the training model. Image create by the author.

In Figure 2, a diagram of the structure of the training model is shown. The diagram was created with the help of the Keras function plot_model.

As it is necessary to provide the target sentences as input to the decoder during training, the variable decoder_input is part of the input for the training model.

Finally, two inference models are created — the encoder model and the decoder model — as shown in the following code snippet:

#Define the encoder model
encoder_model = Model(encoder_input, encoder_state)

#Define the decoder model
decoder_output, state_h, state_c = decoder_lstm(decoder_input_embedded,
initial_state=decoder_state_input)
decoder_state = [state_h, state_c]
decoder_output = decoder_dense(decoder_output)
decoder_model = Model([decoder_input] + decoder_state_input,
[decoder_output] + decoder_state)

The resulting structure of the encoder model and decoder model are shown in Figures 3 and 4, respectively.

Figure 3: Architecture of the encoder model. Image create by the author.
Figure 4: Architecture of the decoder model. Image create by the author.

Training the Model

Once the training model has been defined and our training data is ready, the training can start. The following code snippet shows the training:

training_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
training_model.fit(x=[x_train_encoder, x_train_decoder], y=y_train, epochs=40)
training_model.fit(x=[x_train_encoder, x_train_decoder], y=y_train, epochs=40)

After 40 epochs of training, the model´s output matches the training data by an accuracy of about 95%, which is already quite good.

Using the Inference Models

Now that the training is completed, the inference models can finally be used to talk to the chat bot. To do this, the request is first passed into the encoder model to compute the internal state. Since the expected output is not available, as it was during training, the target sequence which is used as input for the encoder consists only of the special word “<START>”. This target sequence, together with the computed encoder state, is passed into the decoder model to compute the first word of our response. As the dense layer computes a probability for each word in the vocabulary, the word with the highest probability is selected. After that, the initial state is updated according to the state output by the decoder and the target sequence is updated according to the first word offered by that output. This is repeated until the maximum number of words per sentence is reached or the computed word is “<PAD>”. A list with the computed index for each word in the response is the result. Finally, the index has to be translated back into words to get the response of the chat bot.

The following code snippet shows how the inference models are used:

def talk_with_chat_bot(request):
x_test_raw = []
x_test_raw.append(request)
x_test = sentences_to_index(x_test_raw, MAX_INPUT_SENTENC_LENGTH)
state = encoder_model.predict(x_test)

target_seq = np.zeros((1, MAX_OUTPUT_SENTENC_LENGTH))
target_seq[0, 0] = START_INDEX

output_index = []
chat_bot_response = ""

for i in range(MAX_OUTPUT_SENTENC_LENGTH-1):
output_tokens, state_h, state_c = decoder_model.predict([target_seq] + state)
predicted_word_index = np.argmax(output_tokens[0, i, :])
output_index.append(predicted_word_index)
if predicted_word_index == PAD_INDEX:
break
target_seq[0,i+1] = predicted_word_index
state = [state_h, state_c]

for output_token in output_index:
for key, value in word_index.items():
if value == output_token \
and value != PAD_INDEX \
and value != UNKNOWN_INDEX:
chat_bot_response += " " + key

print("Request: " + request)
print("Response:" + chat_bot_response)

talk_with_chat_bot("wo sollen wir uns treffen")
talk_with_chat_bot("guten tag")
talk_with_chat_bot("wie viel uhr treffen wir uns")

Results

The chat bot answers with reasonable sentences and good grammar, as shown in the following output of the chat bot. For those who are not familiar with the German language, a translation is provided in parentheses:

Request: wo sollen wir uns treffen (where should we meet)
Response: am haupteingang (at the main entrance)
Request: guten tag (good day)
Response: hey du wie läuft es bei dir (hey you how is it going)
Request: wie viel uhr treffen wir uns (what time do we meet)
Response: oh (oh)

The above described approach delivers good results after a reasonable amount of effort. Due to the already learned word embeddings, training does not take long and you do not need that much training data.

This is a very basic approach and even better results can be achieved using several possible improvements. For example, the recurrent networks can have not just one layer but two or four. They also can be implemented as a bidirectional recurrent network. In this case, the recurrent neural network does not only look at words which are behind the current one, but also at words before it. To get various responses to the same request, the algorithm for selecting the next word (in the case described above, the one with the highest probability was picked, which is called greedy sampling) could be improved. Some randomness could be applied to the selection of the next word, i.e. by selecting a word randomly out of the words with the highest probabilities (this is called stochastic sampling).

Summary

What you learn in this article:

  • The basics of an encoder-decoder network and word embeddings
  • The concept of transfer learning. Concretely, how pre-trained word embeddings can be loaded into an embedding layer of Keras.
  • How training data can be obtained and prepared to make a chat bot that talks like you.
  • How models can be set up and trained for an encoder-decoder network.
  • How the inference models can be used to generate responses from your chat bot.

References

The following is a list of sources that offer more information on the topics in this article:

[1] Jeffrey Pennington and Richard Socher and Christopher D. Manning, GloVe: Global Vectors for Word Representation (2014), Empirical Methods in Natural Language Processing (EMNLP)

[2] Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space (2013), arXiv

[3] Hochreiter, Sepp and Schmidhuber, Jürgen, Long Short-term Memory (1997), Neural computation

[4] Kyunghyun Cho and Bart van Merrienboer and Caglar Gulcehre and Dzmitry Bahdanau and Fethi Bougares and Holger Schwenk and Yoshua Bengio, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), EMNLP 2014

[5] Ilya Sutskever and Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks (2014), arXiv

--

--

Passionate software developer with more than 12 years of professional experience. Exited about artificial intelligence, software engineering, and data science.