Solving NLP task using Sequence2Sequence model: from Zero to Hero

Published in

Towards Data Science

13 min readNov 30, 2018

Today I want to solve a very popular NLP task called Named Entity Recognition (NER). In short, NER is a task of extracting Name Entities from a sequence of words (a sentence). For example, given the sentence:

“Jim bought 300 shares of Acme Corp. in 2006.”

We want to say that “Jim” is a person, “Acme” is an organization and “2006” is time.

To do that, I’ll use this publicly available Kaggle dataset. I’ll skip here all the data processing code and focus on the actual problem and solution. You can see the full code in this notebook. In this dataset, there are many Entity types like Person (PER), Organisation (ORG) and others, and for each entity type there are two types of tags: “B-SOMETAG” and “I-SOMETAG”. B represents the start of entity name and I-represents the continuance of that entity. So if we have an entity like “Word Health Organisation” the corresponding tags would be [B-ORG, I-ORG, I-ORG]

Here's an example from the dataset:

import pandas as pd
ner_df = pd.read_csv('ner_dataset.csv')
ner_df.head(30)

So we get some sequence (sentence), and we want to predict the “class” of each word. It’s a not so trivial Machine Learning task like classification or regression. We get a sequence, and our output should be a sequence with the same size.

There are many ways to solve this. Here I’m going to do the following:

Build a very simple model that treats this task as a classification of each word in every sentence and use it as a benchmark.
Build a Sequence to Sequence model using Keras.
Talk about what is the right way to measure and compare our results.
Use pre-trained Glove embeddings in the Seq2Seq model.

Feel free to jump to any section.

Bag of Words and Multi-class Classification

As I mentioned before, our output should be a sequence of classes, but first, I want to explore somewhat a naive approach — a simple multi-class classification model. I want to treat each word in every sentence as a separate instance, and for each instance (word) I want to be able to predict its class, i.e., one of O, B-ORG, I-ORG, B-PER and so on. This is of course not the best way to model this problem, but I want to do this for 2 reasons. I want to create a benchmark while keeping things as simple as possible, and, I want to show that a sequence to sequence model works much better when we are working with, well, sequences. Many times when we try to model a problem in real life, it’s not always clear what type of problem we’re dealing with. Sometimes we try to model these problems as simple classification tasks while in reality, a sequence model could be much better.

As I said, I treat this approach as a benchmark and keep things as simple as possible, so for each word (instance), my features will be simply the word vector (Bag of words) and all other words in the same sentence. My target variable will be one of 17 classes.

def sentence_to_instances(words, tags, bow, count_vectorizer):
    X = []
    y = []
    for w, t in zip(words, tags):
        v = count_vectorizer.transform([w])[0]
        v = scipy.sparse.hstack([v, bow])
        X.append(v)
        y.append(t)
        
    return scipy.sparse.vstack(X), y

So given a sentence like:

“The World Health Organization says 227 people have died from bird flu”

We’ll get 12 instances for each word.

the             O
world           B-org
health          I-org
organization    I-org
says            O
227             O
people          O
have            O
died            O
from            O
bird            O
flu             O

Now our task is, given a single word in a sentence, predict its class.

We have 47958 sentences in our dataset, we break them into “train” and “test” sets:

train_size = int(len(sentences_words) * 0.8)train_sentences_words = sentences_words[:train_size]
train_sentences_tags = sentences_tags[:train_size]
test_sentences_words = sentences_words[train_size:]
test_sentences_tags = sentences_tags[train_size:]# ============== Output ==============================Train: 38366 
Test: 9592

We’ll use the method above to transform all the sentences into many instances of words. In the train dataset, we have 839,214 word instances.

train_X, train_y = sentences_to_instances(train_sentences_words,           
                                          train_sentences_tags, 
                                          count_vectorizer)print 'Train X shape:', train_X.shape
print 'Train Y shape:', train_y.shape# ============== Output ==============================
Train X shape: (839214, 50892)
Train Y shape: (839214,)

In our X we have 50892 dimensions which are: a one hot vector for the current word, and, a Bag of Words vector for all other words in the same sentence.

We’ll use Gradient Boosting Classifier as our predictor:

clf = GradientBoostingClassifier().fit(train_X, train_y)
predicted = clf.predict(test_X)
print classification_report(test_y, predicted)

We get:

               precision  recall    f1-score   support

      B-art       0.57      0.05      0.09        82
      B-eve       0.68      0.28      0.40        46
      B-geo       0.91      0.40      0.56      7553
      B-gpe       0.96      0.84      0.90      3242
      B-nat       0.52      0.27      0.36        48
      B-org       0.93      0.31      0.46      4082
      B-per       0.80      0.52      0.63      3321
      B-tim       0.91      0.66      0.76      4107
      I-art       0.09      0.02      0.04        43
      I-eve       0.33      0.02      0.04        44
      I-geo       0.82      0.55      0.66      1408
      I-gpe       0.86      0.62      0.72        40
      I-nat       0.20      0.08      0.12        12
      I-org       0.88      0.24      0.38      3470
      I-per       0.93      0.25      0.40      3332
      I-tim       0.67      0.15      0.25      1308
          O       0.91      1.00      0.95    177215

avg / total       0.91      0.91      0.89    209353

Is it good? It’s hard to know, but it looks not so bad. We probably could think about several ways to improve our model, but it’s not the goal of the post, and as I said, I want to keep it a very simple benchmark.

We have a problem though. This is not the right way to measure our model. We get precision/recall for each word, but it doesn’t tell us anything about the real entities. Here’s a simple example, given the same sentence:

The World Health Organization says 227 people have died from bird flu”

We have 3 classes with the ORG class, if we predict correctly only two of them, we’ll get 66% accuracy on the words, but we haven’t extracted “World Health Organization” entity correctly, so our accuracy on the entities would be 0!

I’ll talk about a better way to measure our Named Entity Recognition model later here, but first, let’s build our “Sequence to Sequence” model.

Sequence to Sequence Model

One major drawback of the previous approach is that we lose the dependency information. Given a word in a sentence, it could be beneficial to know that the word on the left (or on the right) is an entity. Not only It’s hard to do so when we build instance for each word, but we also don’t get this information in prediction time. This is one reason to use the whole sequence as an instance.

There are many different models that we can use to do that. Algorithms like Hidden Markov Models (HMM) or Conditional Random Fields (CRF) probably work well, but here, I want to implement a Recurrent Neural Network using Keras.

To use Keras, we need to turn our sentences into sequences of numbers, where each number represent a word, and, we need to make all our sequences the same length. We can do it using Keras util methods.

First, we fit a Tokenizer that will help us turn our words into numbers. It is very important to fit it only on the train set.

words_tokenizer = Tokenizer(num_words=VOCAB_SIZE, 
                            filters=[], 
                            oov_token='__UNKNOWN__')
words_tokenizer.fit_on_texts(map(lambda s: ' '.join(s),                      
                                 train_sentences_words))word_index = words_tokenizer.word_index
word_index['__PADDING__'] = 0
index_word = {i:w for w, i in word_index.iteritems()}# ============== Output ==============================
print 'Unique tokens:', len(word_index)

Next, We’ll create the sequences using the Tokenizer and pad them to get a sequence of the same length:

train_sequences = words_tokenizer.texts_to_sequences(map(lambda s: ' '.join(s), train_sentences_words))
test_sequences = words_tokenizer.texts_to_sequences(map(lambda s: ' '.join(s), test_sentences_words))train_sequences_padded = pad_sequences(train_sequences, maxlen=MAX_LEN)
test_sequences_padded = pad_sequences(test_sequences, maxlen=MAX_LEN)print train_sequences_padded.shape, test_sequences_padded.shape
# ============== Output ==============================
(38366, 75) (9592, 75)

We can see that we have 38,366 sequences in train set and 9,592 in test with 75 tokens in each sequence.

We want to do something similar to our tags as well, I’ll skip here the code, as before, you can find it here.

print train_tags_padded.shape, test_tags_padded.shape# ============== Output ==============================
(38366, 75, 1) (9592, 75, 1)

We have 38,366 sequences in train set and 9,592 in test with 17 tags in each sequence.

Now we are ready to build our model. We’ll use Bidirectional Long short-term memory (LSTM) layers as they were proven to be very effective for such tasks:

input = Input(shape=(75,), dtype='int32')emb = Embedding(V_SIZE, 300, max_len=75)(input)x = Bidirectional(LSTM(64, return_sequences=True))(emb)preds = Dense(len(tag_index), activation='softmax')(x)model = Model(sequence_input, preds)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['sparse_categorical_accuracy'])

Let’s see what we have here:

Our first layer is Input it accepts vectors of shape (75,) and matches our X variable (we have 75 tokens in each of our sequences in train and test).

Next, we have the Embedding layer. This layer will take each of our tokens/words and turn it into a dense vector of size 300. Think of it as a giant lookup table (or dictionary) with tokens (word ids) as keys and the actual vectors as values. This lookup table is trainable, i.e., each epoch during the model training, we update those vectors to match out the input.

After the Embedding layer, our input turns from a vector of length 75 to a matrix of size (75, 300). Each of the 75 tokens now has a vector of size 300.

Once we have this, we can use the Bidirectional LSTM layer that for each token will look both ways in the sentence and return a state that will help us classify the word later on. By default, the LSTM layer will return a single vector (the last one), but in our case, we want a vector for each token, so we use return_sequences=True

It looks something like this:

The output of this layer is a matrix of size (75, 128) — 75 tokens, 64 numbers for one direction and 64 for the other.

Finally, we have a Time Distributed Dense layer (it becomes Time Distributed when we use return_sequences=True )

It takes the (75, 128) matrix of the LSTM layer output and returns the desired (75, 18) matrix — 75 tokens, 17 tag probabilities for each token and one for __PADDING__ .

It’s very easy to see what going on using the model.summary() method:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 75, 300)           8646600   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 75, 128)           186880    
_________________________________________________________________
dense_2 (Dense)              (None, 75, 18)            627       
=================================================================
Total params: 8,838,235
Trainable params: 8,838,235
Non-trainable params: 0
_________________________________________________________________

You can see all our layers with their input and output shapes. Also, we can see the number of parameters we have in our model. You probably noticed that our embedding layer has the most parameters. The reason is that we have many words and we need to learn 300 number for each word. Later in this post, we’ll use pre-trained embeddings to improve our model.

Let’s train our model:

model.fit(train_sequences_padded, train_tags_padded,
          batch_size=32,
          epochs=10,
          validation_data=(test_sequences_padded, test_tags_padded))# ============== Output ==============================
Train on 38366 samples, validate on 9592 samples
Epoch 1/10
38366/38366 [==============================] - 274s 7ms/step - loss: 0.1307 - sparse_categorical_accuracy: 0.9701 - val_loss: 0.0465 - val_sparse_categorical_accuracy: 0.9869
Epoch 2/10
38366/38366 [==============================] - 276s 7ms/step - loss: 0.0365 - sparse_categorical_accuracy: 0.9892 - val_loss: 0.0438 - val_sparse_categorical_accuracy: 0.9879
Epoch 3/10
38366/38366 [==============================] - 264s 7ms/step - loss: 0.0280 - sparse_categorical_accuracy: 0.9914 - val_loss: 0.0470 - val_sparse_categorical_accuracy: 0.9880
Epoch 4/10
38366/38366 [==============================] - 261s 7ms/step - loss: 0.0229 - sparse_categorical_accuracy: 0.9928 - val_loss: 0.0480 - val_sparse_categorical_accuracy: 0.9878
Epoch 5/10
38366/38366 [==============================] - 263s 7ms/step - loss: 0.0189 - sparse_categorical_accuracy: 0.9939 - val_loss: 0.0531 - val_sparse_categorical_accuracy: 0.9878
Epoch 6/10
38366/38366 [==============================] - 294s 8ms/step - loss: 0.0156 - sparse_categorical_accuracy: 0.9949 - val_loss: 0.0625 - val_sparse_categorical_accuracy: 0.9874
Epoch 7/10
38366/38366 [==============================] - 318s 8ms/step - loss: 0.0129 - sparse_categorical_accuracy: 0.9958 - val_loss: 0.0668 - val_sparse_categorical_accuracy: 0.9872
Epoch 8/10
38366/38366 [==============================] - 275s 7ms/step - loss: 0.0107 - sparse_categorical_accuracy: 0.9965 - val_loss: 0.0685 - val_sparse_categorical_accuracy: 0.9869
Epoch 9/10
38366/38366 [==============================] - 270s 7ms/step - loss: 0.0089 - sparse_categorical_accuracy: 0.9971 - val_loss: 0.0757 - val_sparse_categorical_accuracy: 0.9870
Epoch 10/10
38366/38366 [==============================] - 266s 7ms/step - loss: 0.0076 - sparse_categorical_accuracy: 0.9975 - val_loss: 0.0801 - val_sparse_categorical_accuracy: 0.9867

We get 98.6% accuracy on our test set. This accuracy doesn’t tell us much, as most of our tags are “0” (other). We want to see precision/recall for each class as before, but as I mentioned in the previous section, this is also not the best way to evaluate our model. What we want, is a way to see how many of the entities of a different type we were able to predict correctly.

Evaluation of Sequence to Sequence model

When we work with sequences, our tags/entities are probably will be sequences as well. As I shown before, if we have “World Health Organisation” as the true entity, predicting “World Organisation” or “World Health” may give us 66% accuracy on the words level, but both are wrong predictions. We want to wrap all the entities in each sentence and compare them to the predicted ones.

We can use the excellent seqeval library for this. For each sentence, it looks for all different tags and constructs the entities. By doing both to the true tags and the predicted tags, we can to compare the real entities values and not just words. In this case, there’s no “B-” or “I-” tags, we compare the actual type of entity and not word classes.

Using our predicted values, which is a matrix of probabilities, we want to construct a sequence of tags for each sentence with the original length (and not 75 as we did) so we can compare them to the true values. We will do this both for our LSTM model and our Bag of Words model:

lstm_predicted = model.predict(test_sequences_padded)lstm_predicted_tags = []
bow_predicted_tags = []
for s, s_pred in zip(test_sentences_words, lstm_predicted):
    tags = np.argmax(s_pred, axis=1)
    tags = map(index_tag_wo_padding.get,tags)[-len(s):]
    lstm_predicted_tags.append(tags)
    
    bow_vector, _ = sentences_to_instances([s], 
                                           [['x']*len(s)], 
                                           count_vectorizer)
    bow_predicted = clf.predict(bow_vector)[0]
    bow_predicted_tags.append(bow_predicted)

Now we are ready to evaluate both our models using the seqeval library:

from seqeval.metrics import classification_report, f1_scoreprint 'LSTM'
print '='*15
print classification_report(test_sentences_tags, 
                            lstm_predicted_tags)
print 
print 'BOW'
print '='*15
print classification_report(test_sentences_tags, bow_predicted_tags)

We get:

LSTM
===============
             precision    recall  f1-score   support

        art       0.11      0.10      0.10        82
        gpe       0.94      0.96      0.95      3242
        eve       0.21      0.33      0.26        46
        per       0.66      0.58      0.62      3321
        tim       0.84      0.83      0.84      4107
        nat       0.00      0.00      0.00        48
        org       0.58      0.55      0.57      4082
        geo       0.83      0.83      0.83      7553

avg / total       0.77      0.75      0.76     22481


BOW
===============
             precision    recall  f1-score   support

        art       0.00      0.00      0.00        82
        gpe       0.01      0.00      0.00      3242
        eve       0.00      0.00      0.00        46
        per       0.00      0.00      0.00      3321
        tim       0.00      0.00      0.00      4107
        nat       0.00      0.00      0.00        48
        org       0.01      0.00      0.00      4082
        geo       0.03      0.00      0.00      7553

avg / total       0.01      0.00      0.00     22481

There’s a big difference. You can see that the BOW model wasn’t able to predict almost anything right, while the LSTM model did a much better job.

Of course, we could work more on the BOW model and achieve much better results, but the big picture is clear, the Sequence to Sequence model is much more appropriate in this case.

Pre-trained Word Embeddings

As we saw before, most of our model parameters were for the Embedding layer. Training this layer is very hard as there are many words and there are limited training data. It’s very common to use pre-trained Embedding layers. Most of the current embedding models use what called “distributional hypothesis” that states that words in a similar context have similar meanings. By building a model that predicts a word given a context (or the other way around) they can produce word vectors that have a good representation of the word meanings. While it’s not directly related to our task, using these embeddings may help our model to represent words better for its goal.

There are other ways to build word embeddings, from simple cooccurrences matrix to much more complex language models. In this post, I tried to build word embeddings using images.

Here we’ll use the popular Glove embeddings. Word2Vec or any other implementation might work as well.

We need to download it, load the word vectors and create the embedding matrix. We’ll use this matrix as non-trainable weights for our Embedding layer:

embeddings = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings[word] = coefsnum_words = min(VOCAB_SIZE, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, 300))
for word, i in word_index.items():
    if i >= VOCAB_SIZE:
        continue
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Now to our model:

input = Input(shape=(75,), dtype='int32')emb =  Embedding(VOCAB_SIZE, 300,                             
                embeddings_initializer=Constant(embedding_matrix),
                                           input_length=MAX_LEN,
                                           trainable=False)(input)x = Bidirectional(LSTM(64, return_sequences=True))(emb)preds = Dense(len(tag_index), activation='softmax')(x)model = Model(sequence_input, preds)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['sparse_categorical_accuracy'])model.summary()# ============== Output ==============================
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 75, 300)           8646600   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 75, 128)           186880    
_________________________________________________________________
dropout_2 (Dropout)          (None, 75, 128)           0            
_________________________________________________________________
dense_4 (Dense)              (None, 75, 18)            627       
=================================================================
Total params: 8,838,235
Trainable params: 191,635
Non-trainable params: 8,646,600
_________________________________________________________________

Everything is the same as before. The only difference is that now we have constant non-trainable weights for our embedding layer. You can see that the number of total parameters is not changed, while the number of trainable parameters is much lower.

Let’s fit the model:

Train on 38366 samples, validate on 9592 samples
Epoch 1/10
38366/38366 [==============================] - 143s 4ms/step - loss: 0.1401 - sparse_categorical_accuracy: 0.9676 - val_loss: 0.0514 - val_sparse_categorical_accuracy: 0.9853
Epoch 2/10
38366/38366 [==============================] - 143s 4ms/step - loss: 0.0488 - sparse_categorical_accuracy: 0.9859 - val_loss: 0.0429 - val_sparse_categorical_accuracy: 0.9875
Epoch 3/10
38366/38366 [==============================] - 138s 4ms/step - loss: 0.0417 - sparse_categorical_accuracy: 0.9876 - val_loss: 0.0401 - val_sparse_categorical_accuracy: 0.9881
Epoch 4/10
38366/38366 [==============================] - 132s 3ms/step - loss: 0.0381 - sparse_categorical_accuracy: 0.9885 - val_loss: 0.0391 - val_sparse_categorical_accuracy: 0.9887
Epoch 5/10
38366/38366 [==============================] - 146s 4ms/step - loss: 0.0355 - sparse_categorical_accuracy: 0.9891 - val_loss: 0.0367 - val_sparse_categorical_accuracy: 0.9891
Epoch 6/10
38366/38366 [==============================] - 143s 4ms/step - loss: 0.0333 - sparse_categorical_accuracy: 0.9896 - val_loss: 0.0373 - val_sparse_categorical_accuracy: 0.9891
Epoch 7/10
38366/38366 [==============================] - 145s 4ms/step - loss: 0.0318 - sparse_categorical_accuracy: 0.9900 - val_loss: 0.0355 - val_sparse_categorical_accuracy: 0.9894
Epoch 8/10
38366/38366 [==============================] - 142s 4ms/step - loss: 0.0303 - sparse_categorical_accuracy: 0.9904 - val_loss: 0.0352 - val_sparse_categorical_accuracy: 0.9895
Epoch 9/10
38366/38366 [==============================] - 138s 4ms/step - loss: 0.0289 - sparse_categorical_accuracy: 0.9907 - val_loss: 0.0362 - val_sparse_categorical_accuracy: 0.9894
Epoch 10/10
38366/38366 [==============================] - 137s 4ms/step - loss: 0.0278 - sparse_categorical_accuracy: 0.9910 - val_loss: 0.0358 - val_sparse_categorical_accuracy: 0.9895

The accuracy is not changed much, but as we seen before, accuracy is not the right metrics for this. Let’s evaluate it the right way and compare to our previous models:

lstm_predicted_tags = []
for s, s_pred in zip(test_sentences_words, lstm_predicted):
    tags = np.argmax(s_pred, axis=1)
    tags = map(index_tag_wo_padding.get,tags)[-len(s):]
    lstm_predicted_tags.append(tags)print 'LSTM + Pretrained Embbeddings'
print '='*15
print classification_report(test_sentences_tags, lstm_predicted_tags)# ============== Output ==============================LSTM + Pretrained Embbeddings
===============
             precision    recall  f1-score   support

        art       0.45      0.06      0.11        82
        gpe       0.97      0.95      0.96      3242
        eve       0.56      0.33      0.41        46
        per       0.72      0.71      0.72      3321
        tim       0.87      0.84      0.85      4107
        nat       0.00      0.00      0.00        48
        org       0.62      0.56      0.59      4082
        geo       0.83      0.88      0.86      7553

avg / total       0.80      0.80      0.80     22481

Much better, our F1 score increased from 76 to 80!

Conclusion

The sequence to Sequence models are very powerful models for many tasks like Named Entity Recognition (NER), Part of Speech (POS) tagging, parsing and more. There are many techniques and many options to train them, but the most important thing is to know when to use them and how correctly to model our problem.