The world’s leading publication for data science, AI, and ML professionals.

Upgrade Your Beginner NLP Project with BERT

Deep learning doesn't have to be complex.

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Introduction

When I first started learning Data Science and looking at projects, I thought you could either do a Deep Learning or regular project. This is not the case.

With powerful models becoming more and more accessible, we can easily leverage some of the power of deep learning without having to optimize a neural network or use a GPU.

In this post, we are going to look at embeddings. This is the way Deep Learning models represent words as vectors. We can part of the model to generate embeddings and fit a regular (scikit-learn) model on top to get some really incredible results!

I’m going to explain each method individually, using graphs to represent why it works and show how to implement these techniques in Python.

Table of Contents

Prerequisites

  • You should understand the basics of Machine Learning.
  • To get the most out of this, you should know How To fit models in scikit-learn and already have a dataset suitable for NLP.
  • This tutorial is ideal for someone who already has an NLP project and is looking to upgrade it and get a taste for Deep Learning.
  • Each model in this article increases in complexity. This article will explain the fundamentals and how to use the technology but you might want to visit some of the links provided to fully understand the concepts.

Dataset

To illustrate each model, we’re going to use the Kaggle NLP with Disaster Tweets dataset. This is around 10,000 Tweets that have been selected based on keywords (eg ablaze) then tagged for whether they or not they are about a real disaster.

You can read about the competition and see results here:

Natural Language Processing with Disaster Tweets

You can view or clone all the code here:

AdamShafi92/Exploring-Embeddings

Visualisations

We’re going to explore each model using 2 visualisations. I’ve included examples below.

To visualise words…

A UMAP representation of all of the sentences. UMAP is a method of dimensionality reduction that allows us to view our high dimensional word representations in just 2 dimensions.

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

It is great for visualising clusters of topics but if you haven’t encountered dimensionality reduction before, this might be confusing. We’re essentially just looking for our words to be separated into clusters where Tweets with similar topics are spatially near to each other. A **** clear separation between the blue (non-disaster) and orange (disaster) texts would also be good, as this would suggest our model would be able to classify this data well.

To assess model performance…

A group of 5 charts. From left to right:

  1. ROC AUC. This is a typical scoring system that allows us to compare models. It takes into account predicted probabilities
  2. Precision/Recall. Another typical metric, we’re looking for a large, smooth AUC.
  3. Feature importances. This is so we can compare what we get out of each method. It won’t show much for BERT but helps to illustrate explainability
  4. Predicted probabilities. This allows us to visualise how well the model is differentiating between the two classes. Ideally, we want to see clusters at 0 and 1 with very little around 50%.
  5. Confusion Matrix. We can visualise false positives vs false negatives.

Definitions

  • Vector: The classic description of a vector is a quantity with both a magnitude and a direction (eg 5 miles West). In Machine Learning, we often work with high dimensional vectors.
  • Embedding: A way of representing a word (or sentence) as a vector.
  • Document: An individual text.
  • Corpus: A group of texts.

Representing Words as Vectors

In order to create a model based on words, we have to transform those words into a number. The simplest way to do this would be to one hot encode every word and tell our model:

  • Sentence #1 has word #1, word #12 and word #13.
  • Sentence #2 has word #6, word #24 and word #35.

Bag of Words and TDF-IDF represent words this way, building on this by including some measure of the frequency the words appear).

Bag of Words methods represent words as vectors by simply creating a column for each word and indicating with a number where the word is present. The vector will be the same size as the number of unique words in the corpus.

This is fine for some approaches but we lose information about words that have different meanings in the same sentence or how context can change the meaning of a word.

Turning words into numbers, or vectors, is known as embedding the words. We can describe a set of words turned into vectors as embeddings.

Our aim when vectorising words is to represent the words in a way that captures the most information possible…

How can we tell a model that a word is similar to another? How does it know that completely different words mean the same thing? Or what about how another word changes the meaning of the word the follows it? Or even when a word has multiple meanings in the same sentence? (Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo — I’m looking at you)

Deep learning has allowed a variety of techniques to be developed that go a long way in answering most of these questions.

Bag of Words Methods

This is the simplest way of representing words. We represent each sentence as a vector by taking all the words in the corpus and giving each a 1 or 0 depending on whether it is present in the sentence.

You can see how this could get very large as the number of words increases. An issue with this is our vector starts to get sparse. If we had a lot of short sentences with a wide range of words, there would be a lot of 0’s in our dataset. Sparsity can exponentially increase our computation time.

We can ‘upgrade’ a bag of words representation by taking counts of each word, instead of just 1 or 0. When we take counts we can also remove words that don’t appear much in the corpus, for example, we could remove every word that appears less than 5 times.

Another way to improve bag of words is to use n-grams. This is just taking n words instead of just 1. This helps to capture more of the context in the sentence.

Count Vectoriser

Intuition

This is the simplest way to vectorise language. We simply count each word in the sentence. In most cases, it as advised to remove very common words and very rare words.

Implementation

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(min_df=5,max_df=.99, ngram_range=(1, 2)) #remove rare and common words with df parameter
#include single and 2 word pairs
X_train_vec = bow.fit_transform(X_train['text'])
X_test_vec = bow.transform(X_test['text'])
cols = bow.get_feature_names() #if you need feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

This 2d representation of our 4000 dimension vector isn’t great. We have a blob in the middle and many disparate points. There isn’t a clear way for our model to cluster or separate the data.

You can open the plot to mouse over and view what each point is.

Our model has performed pretty well regardless, it is able to differentiate some a good number of tweets. However, from the feature importance we can see that it is mainly doing this using urls. Is this a valid way to spot disaster Tweets?

TF-IDF

Intuition

An issue with using Bag of Words and counts is that frequent words, such as the, start to dominate the feature space without providing any additional information. There may be domain specific words that are much more important, but get lost or ignored by a model as they aren’t as frequent.

TF-IDF stands for Term Frequency – Inverse Document Frequency

  • Term Frequency: frequency score of the word in the current document.
  • Inverse Document Frequency: scoring of how rare the word is across the corpus.

In TF-IDF, we score the words as we would in Bag of Words, using their frequency. We then penalise any words which are frequent across all documents (such as the, and, or).

We can also use n-grams with TF-IDF.

Implementation

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf= TfidfVectorizer(min_df=5,max_df=.99, ngram_range=(1, 2)) #remove rare and common words with df parameter
#include single and 2 word pairs
X_train_vec = tfidf.fit_transform(X_train['text'])
X_test_vec = tfidf.transform(X_test['text'])
cols = tfidf.get_feature_names() #if you need feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

TF-IDF isn’t much different from the Count Vectoriser on this dataset. There is still a lot of overlap between the disaster and non-disaster tweets.

We see a small uplift in model performance by using TF-IDF. Generally, this does perform better as we downweight common words which typically don’t add anything to the model.

Embedding Words as Vectors

Bag of Words models have 3 key issues:

  1. Similar words are not related to each other. The model does not know that the words bad and terrible are similar, just that these are both related to negative sentiment.
  2. Words aren’t taken in context. Sarcasm or even not bad might not be captured well. Words with dual meanings aren’t captured.
  3. Using large corpuses results in very large, sparse vectors. This make computation at scale difficult.

With Deep Learning, we move from simple representations to embeddings. Unlike the previous methods, Deep Learning models typically output a fixed length vector that doesn’t have to be the same length as the number of words in the corpus. We are now creating a unique vector representation for each word or sentence in our dataset.

Word2Vec

Word2Vec is a deep learning method to generate embeddings, published in 2013. It can be trained on your corpus relatively easily, but the purpose of this tutorial is to use pretrained methods. I briefly will explain how the model is trained.

There are 2 ways this model is trained.

  • Skip-gram: The model loops over each word in the sentence and tries to predict the neighboring words.
  • Continuous Bag of Words: The model loops over each words and uses the surrounding n words to predict it.

For a deep dive into the model, look no further than this fantastic post by Jay Alammer.

Implementation

To implement Word2Vec, we’re going to use a version trained on the Google News dataset from Gensim. This model outputs a vector of size 300 for each word. In theory, similar words should have a similar vector representation.

An issue with Word2Vec and GLoVe is that we can’t easily generate a sentence embedding.

To generate a sentence embedding with Word2Vec or GLoVe, we have to generate a 300 size vector for each word and then average them. The problem with this is that although similar sentences should have similar sentence vectors, we lose any information about the order of words.

import gensim
import gensim.models as g
import gensim.downloader
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

def vectorize_sentence(sentence,model):
    NLP = English()
    tokenizer = Tokenizer(nlp.vocab)
    a = []
    for i in tokenizer(sentence):
        try:
            a.append(model.get_vector(str(i)))
        except:
            pass

    a=np.array(a).mean(axis=0)
    a = np.zeros(300) if np.all(a!=a) else a
    return a
word2vec = gensim.downloader.load('word2vec-google-news-300') #1.66 gb
# vectorize the data
X_train_vec = pd.DataFrame(np.vstack(X_train['text'].apply(vectorize_sentence, model=word2vec)))
X_test_vec = pd.DataFrame(np.vstack(X_test['text'].apply(vectorize_sentence, model=word2vec)))
# Word2Vec doesn't have feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

At first glance, Word2Vec appears to have represented our data much better than the previous methods. There are clear regions of blue and separate regions of orange. The cluster on the top left seems to be predominantly words with capital letters, in other regions there are tweets about weather.

Unfortunately, at first glance, this doesn’t relate to model performance. The accuracy score is slighly worse than TF-IDF. However, if we look at the confusion matrix, we can see that this model is doing a better join at recognising disaster tweets.

A big issue here is that we now do not know what is driving these better predictions. There is a feature that is clearly used by the model more than the others, but we can’t find out what this represents without additional work.

GloVe

Intuition

GloVe stands for Global Vectors.

GloVe is similar to Word2Vec in that it is an early approach to embeddings, having been released in 2014. However, the key difference of GloVe is that, GloVe does not rely just on nearby words, but incorporates global statisticsword occurrence across the corpus, to obtain word vectors.

The way GloVe is trained is by calculating a co-occurrence matrix of every word in the corpus. Some type of dimensionality reduction is then carried out on this matrix to reduce it to a fixed size, leaving us with a vector for every sentence. We can access pretrained versions of this model very easily. If you’d like to know more about how it works, take a look here.

Implementation

We’re using the GloVe ‘Gigaword‘ model which is trained on a Wikipedia corpus. You’ll notice the size of this is much smaller than the Word2Vec model which suggests it is probably trained on fewer words. This is a problem because of GLoVe doesn’t recognise a word in our dataset, it’ll return an error (which we replace with 0s…).

import gensim
import gensim.models as g
import gensim.downloader
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
def vectorize_sentence(sentence,model):
    nlp = English()
    tokenizer = Tokenizer(nlp.vocab)
    a = []
    for i in tokenizer(sentence):
        try:
            a.append(model.get_vector(str(i)))
        except:
            pass

    a=np.array(a).mean(axis=0)
    a = np.zeros(300) if np.all(a!=a) else a
    return a
gv = gensim.downloader.load('glove-wiki-gigaword-300') #376mb
# vectorize the data
X_train_vec = pd.DataFrame(np.vstack(X_train['text'].apply(vectorize_sentence, model=gv)))
X_test_vec = pd.DataFrame(np.vstack(X_test['text'].apply(vectorize_sentence, model=gv)))
# GloVe doesn't have feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

The GLoVe vectors are interesting, the area in the top right is tweets with capitalised first letter of each word. This isn’t something we’re interested in differentiating. Otherwise, there is a lot of overlap between blue and orange.

Our GloVe model performs significantly worse than the others so far. The most likely reason for this is that this model doesn’t understand many of the words in our corpus. In order to resolve this you’d have to train this model yourself on the corpus (or some Twitter data).

Doc2Vec

Intuition

The key problem with GLoVe and Word2Vec is that we are just averaging across the sentence. Doc2Vec is pretrained on sentences and should create a better representation of our sentence.

Implementation

Doc2Vec isn’t part of the Gensim library, so I found a version online which has been pretrained, however I’m not sure what on.

# Model downloaded from https://ai.intelligentonlinetools.com/ml/text-clustering-doc2vec-word-embedding-machine-learning/
#https://ibm.ent.box.com/s/3f160t4xpuya9an935k84ig465gvymm2
# Load unzipped model, saved locally
model="../doc2vec/doc2vec.bin"
m = g.Doc2Vec.load(model)
# Instantiate SpaCy Tokenizer
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
# Loop Through texts and create vectors
a=[]
for text in tqdm(X_train['text']):
    a.append(m.infer_vector([str(word) for word in tokenizer(text)]))

X_train_vec = pd.DataFrame(np.array(a))

a=[]
for text in tqdm(X_test['text']):
    a.append(m.infer_vector([str(word) for word in tokenizer(text)]))

X_test_vec = pd.DataFrame(np.array(a))
# Doc2Vec doesn't have feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

I was expecting big things from this model but it didn’t deliver. The far left region is tweets with @’s while the far right is mainly URLs. It’s great the model picks up on these (despite encoding full sentences) but we are looking for more nuance than this.

My above comments are reflected in the model, with this performing as badly as GLoVe.

Transformer-based Models

I’m not going to go into too much detail here but it is worth understanding transformer based models because since the release of Google’s paper in 2017, this model architecture has caused the explosion in state-of-the-art NLP models we’ve seen in the last few years.

Even though these models were released so recently and trained on large datasets, we can still access them using high-level python libraries. Yes, we can leverage state of the art, deep learning models using just a few lines of code.

Universal Sentence Encoder

Universal Sentence Encoder Visually Explained

Google’s Universal Sentence Encoder includes a transformer architecture and Deep Averaging Network. When released, it achieved state of the art results because traditionally, sentence embeddings were averaged over the whole sentence. In the Universal Sentence Encoder, each word has an impact.

The main benefit of using this over Word2Vec is:

  1. It is incredibly easy to use using Tensorflow Hub. The model automatically generate an embedding for the whole sentence.
  2. The model captures word order and context much better than Word2Vec.

For more information, see this explanation and Google’s download page.

Implementation

This is one of the easiest models to implement.

import tensorflow_hub as hub
def embed_document(data):
    model = hub.load("../USE/")
    embeddings = np.array([np.array(model([i])) for i in data])
    return pd.DataFrame(np.vstack(embeddings))
# vectorize the data
X_train_vec = embed_document(X_train['text'])
X_test_vec = embed_document(X_test['text'])
# USE doesn't have feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

Now this is interesting. There is a good separation between orange and blue. Hovering over the tweets, its clear that semantically similar tweets are close to each other.

If you ran the code, you’ll also notice that this model embeds sentences extremely fast, this is a big benefit as NLP work can be slow due to large data sizes.

As expected, this model performs very well. Although accuracy is only slightly better than TF-IDF, there is uplift on every metric I looked at.

Interpretability is still an issue. One feature seems more important than the rest but what does it correspond to?

BERT

UKPLab/sentence-transformers

BERT stands for Bidirectional Encoder Representations from Transformers. It is a deep learning model with a transformer architecture. The model was trained in a similar way to Word2Vec by masking a word in the middle of the sentence and getting the model to fill in the blanks. It was also trained to predict the next sentence, given an input sentence.

BERT was trained on over 3m words from English Wikipedia and the BookCorpus dataset.

Under the hood, there are two key concepts:

Embeddings: A vector representation of words where similar words are ‘near’ to each other. BERT uses ‘Wordpiece’ embeddings (30k words) plus Sentence Embeddings to show which sentence the words are in and Positional Embeddings which represent the position of each word in the sentence. The text can then be fed into BERT.

Attention: The core idea is each time the model predicts an output word, it only uses parts of the input where the most relevant information is concentrated instead of the entire sequence. In simpler words, it only pays attention to some input words.

However, we don’t really need to worry about this as there are ways for us to generate embeddings with a few lines of code.

Implementation

BERT’s representations of words and very powerful. When fine tuned, the model is able to capture semantic difference and word order very well.

The sentence-transformers package allows us to leverage pretrained BERT models that have been trained on specific tasks such as Semantic Similarity or Question Answering. This means our embeddings our specialised to a specific task. This also makes generating an embedding for a full sentence very easy.

In this example I use RoBERTa, which is a optimized version of BERT by Facebook.

from sentence_transformers import SentenceTransformer
bert = SentenceTransformer('stsb-roberta-large') #1.3 gb
# vectorize the data
X_train_vec = pd.DataFrame(np.vstack(X_train['text'].apply(bert.encode)))
X_test_vec = pd.DataFrame(np.vstack(X_test['text'].apply(bert.encode)))
# BERT doesn't have feature names
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

Visualisation

It’s hard to say whether this is better than the Universal Sentence Encoder version. My intuition is that this model has done a worse job of splitting the disaster and non-disaster tweets, but may have done a better job of clustering similar topics. Unfortunately, I can only measure the former at the moment!

This model is objectively worse than the universal sentence encoder. One feature is much more important than the others, I expect this corresponds to urls, perhaps the model is weighting these too strongly but can’t pick up ant detail from the other 1023 vectors.

Conclusions

We explored multiple methods of turning words into numbers. On this dataset, Google’s Universal Sentence Encoder performs best. It’s worth trying this and BERT for most applications as they perform very well. I think Word2Vec is a little outdated nowadays, with methods such as USE being so quick and powerful.

The way many of us learn NLP the first time is by doing a sentiment analysis project with a Bag of Words representations of text. This is a great way to learn but I feel it takes away a lot of the excitement of NLP. There isn’t much difference between a Bag of Words and one hot encoding data. The models produced aren’t particularly effective and will rarely capture any nuance in the text. We can employ BERT embeddings just as easily and this usually leads to huge performance increases.

As a final point, it is always worth considering model interpretability and explainability. With Bag of Words methods, we can clearly say which words influence the model. The BERT models, we can easily say which position in the vector influences the model but it takes considerable effort (and may be almost impossible) for us to say exactly what each vector means. An outstanding question, which I may tackle at a later date, is – did BERT use urls to predict disaster? Or did it understand the language better?

Learn More

Clearing up Logistic Regression

The Must Learn Technical Skills in 2021 for Data Scientists and Analysts

Contact Me

Adam Shafi – Data Scientist – Capgemini | LinkedIn


Related Articles