The world’s leading publication for data science, AI, and ML professionals.

Word2Vec: Out of the Black Box

Best is to Worst as Catan is to Monopoly- but how does it know?

Getting Started

By: Jenna Jones and Nick Paine

Photo by Uriel SC on Unsplash
Photo by Uriel SC on Unsplash

Introduction

How does a computer learn to interpret or even understand human language? This is what we set out to understand when we began researching Natural Language Processing (NLP). One simple but effective tool is Word2Vec, an algorithm created by Mikolov et al in 2013.

Word2Vec takes a large body of text, which can be in any language, and creates a vector representation of words and the context of those words in relation to each other. Once in vector form, the computer processes the word data numerically, allowing it to capture semantic and syntactic characteristics of words and word combinations.

Trained models can answer queries like:

‘Best’ is to ‘Worst’ as ‘Catan’ is to ? and the network answers ‘Monopoly’

and the ‘hello world’ of NLP:

‘Man’ is to ‘King’ as ‘Woman’ is to ? and the network answers ‘Queen’

We consider these examples to be an amazing feat of textual analysis, especially as the analysis is probabilistic.

Terminology

  • vocabulary: the set of unique words in the corpus (V represents the size of the vocabulary).
  • corpus: the full body of raw text used to train the network.
  • word embedding: the vector representations of a word i.e., word vector.
  • center word: the word analyzed- this is the word in the center of the context window.
  • context word: the words around each center word within the context window. Their relative position is not recorded or used in the analysis.
  • context window: the set of x surrounding words to the center word for each word in the corpus. The size of this window can vary.
  • epoch: one epoch is a forward pass and backward propogation through the network.
  • one hot vector: a vector of dimension V with all zeros except one corresponding to a particular word.

Overview of the Algorithm

Word2Vec can use either of two algorithms: CBOW (continuous bag of words) or Skip-Gram. We will only use the Skip-Gram neural network algorithm in this article.

This algorithm uses a center word to predict the probability of each word in the vocabulary V being a context word within a chosen window size.

The center word is presented to the input layer and the output of the prediction layer is the probability of each word in the vocabulary being a context word. The remaining part of the algorithm is discussed in detail below.
The center word is presented to the input layer and the output of the prediction layer is the probability of each word in the vocabulary being a context word. The remaining part of the algorithm is discussed in detail below.

To train the network, we apply each word from the vocabulary to the input layer (a series of one hot vectors) and adjust the weights in the two weight matrices to obtain an output prediction from the network as close as possible to the expected output.

To make sense of all of this, we will work through an example together but to summarize, the steps are as follows:

  • Put data through the network
  • Evaluate the error of the output (loss function) by comparing the actual data to the data predicted by the network
  • Update the weights in the network by using a method called gradient descent to reduce the error
  • Repeat until the error is minimized

Example

We are going to show how Word2Vec works by processing a short example corpus:

"Every rose has its thorn just like every night has its dawn"

This is a corpus of 12 words and 9 unique words ("every", "has" and "its" appear more than once). Therefore, our vocabulary size, V = 9.

Before we get started with our algorithm, we need to do some preprocessing.

Part A- Preprocessing

a. Create a co-occurrence matrix

We collect information about which words occur next to each other- this is called co-occurrence data and is held in a co-occurrence matrix. To obtain this information, a context window is used.

Image by Author, inspired by Stanford XCS224N lecture notes. This diagram shows center words in blue, context words in grey and a context window size of 2
Image by Author, inspired by Stanford XCS224N lecture notes. This diagram shows center words in blue, context words in grey and a context window size of 2

In our example, we use a context window of 2; however, the choice of window size is arbitrary and is usually determined by trial and error. We will run over the entire corpus and use each word in the vocabulary as a center word in turn, recording every word that is within the context window (either in front of or behind the center word). This is done by adding a 1 when the context word appears in its appropriate dimension in the co-occurrence vector.

As in the above diagram, suppose we want to record the words that occur closely to ‘thorn’. We will record 4 words: ‘has’, ‘its’, ‘just’, ‘like’. This process is then repeated for the next word, ‘just’. The words at the beginning and end of the corpus will have only 2 context words.

After completing this process, each word is associated with a co-occurrence vector (combined below to form a 9×9 matrix), and this data is used to train the network.

9x9 co-occurrence matrix
9×9 co-occurrence matrix

b. Initialize two randomly weighted matrices

Here, we initially set random values for each of the two weight matrices in the network to start it off. These weights will change over time and become meaningful as the network ‘learns’. The weights themselves do not mean anything in the human language- they don’t represent a characteristic of the word. They are the means by which the whole network passes information from layer to layer and will change as the network learns relationships between inputs and outputs.

The dimensions of the matrices are based on the number of unique words in our vocabulary and the number of hidden layer neurons. For our example, we have 9 unique words and have chosen 3 hidden layer neurons (this is a hyper parameter, meaning we choose what we think is an appropriate figure); therefore, our input and output weight matrices are of dimensions 9×3 and 3×9, respectively.

Part B- Neural Network: Forward Pass

Step 1: Create a one-hot vector

A one-hot vector for V
A one-hot vector for V

Since the computer does not understand actual words, each word must be converted into a mathematical representation. This is referred to as one-hot encoding. We create a V dimensional vector with each dimension representing each unique word in our vocabulary. The order of the words is insignificant and does not represent any of their characteristics. All dimensions are set to 0 except for the center word that is the input into the network, in this case, ‘thorn’, which is set to 1.

Mathematically,

the one-hot vector of a center word
the one-hot vector of a center word

Step 2: Calculate the Output layer

Step 1
Step 1
Step 2
Step 2

To calculate the forward pass, we take the dot product of the one hot vector and input weight matrix (recall this weight matrix from the Preprocessing section and that we have chosen to use 3 neurons in the hidden layer). This dot product gives the hidden layer (called hidden because it is not part of what we see in the output). Since the input layer is all zeros except for the center word, the multiplication will yield only one row related to the center word. It is important to note that each row in the input weight matrix is actually the word vector used for each word! Therefore, if we were on our last epoch and the weight matrices were finalised, the word ‘thorn’ would be located in the 3-dimensional vector space at (-0.41, -0.63, 0.84).

We then take the dot product of the hidden layer and the output weight matrix (the other weight matrix from the Preprocessing section) to calculate the output layer with 9 neurons. The output of the network is a single vector including every word in the vocabulary.

In mathematical terms,

The end goal is to use the output layer to train the network (shown below in Part C) and use the input weight matrix to create word vectors.

Once the network has been trained, word vectors will place similar words close to each other in the vector space. Similar ranges from ‘dog’ and ‘dogs’ to ‘good’ and ‘great’ to apple and pear.

If you are familiar with neural networks, you may be wondering why we haven’t applied an activation function like the sigmoid, tanh or ReLU functions to the hidden layer. For Word2Vec, this isn’t necessary as the point of the hidden layer (apart from the ability it gives to a network to process more complicated factors such as the exclusive or XOR problem) is to map the input word vectors into a lower dimensionality while maintaining separation between dissimilar words. Most activation functions invoke some "squashing" of the space in one region and expanding of the space in another. This clearly reduces the amount of word mapping space available. On the other hand, we usually find that the value of the hidden layer is compromised if there is a linear activation function (as there is in Word2Vec) as the combination of two or more linear functions is itself linear. This is not a problem in Word2Vec as since the input to the network is a one hot vector, there is only ever one input to each neuron in the hidden layer.

Step 3: Use softmax to calculate the prediction

The output layer of the network consists of numbers but we want to obtain a set of probabilities (ie the sum of the outputs needs to equal 1). In order to do this, the output layer must be passed through a final process. There are a number of methods that can be used, but one of the most popular is softmax. Each value in the outer layer is processed and turned into a probability:

softmax function, where n is the number of dimensions in the output
softmax function, where n is the number of dimensions in the output

So to calculate the probability of the word ‘just’ occurring within a window of two of ‘thorn’, we take:

The prediction vector shows, for each word in the vocabulary, the probability that it is within a window size of x of the center word. In our example, we hope that the prediction vector will show that the words ‘has’, ‘its’, ‘just’ and ‘like’ have a high probability of being within two words of ‘thorn’. Obviously, since the network is not yet trained, the predictions will be wrong at this stage but with training, we will see them converge to the correct figures.

Part C- Neural Network: Backpropagation

In this section, we outline the theory and math involved in backpropagation. If you would like to skip this and continue with the example, click here.

Backpropagation (or backprop in Machine Learning language) is an algorithm that calculates the error between the prediction and the actual figures known from the co-occurence matrix for each input word, and then adjusts the weights in the two weight matrices to minimise this error. When the weights have been adjusted, the loss function is recalculated and the process is done again and again until the loss function is smaller than a pre-determined value or after a given number of iterations have been carried out.

To do this, we calculate a loss function, L, which measures the error between the predicted and actual figures for a known single fixed input/output pair of vectors. When completed, a new fixed input/output pair is then used for the calculation.

We then calculate a gradient of the loss function with respect to the elements in each of the weight matrices and use these gradients to follow the loss function down to a minimum as we changes the weights.

The loss function we will use here is:

There are a number of different ways to use the gradients, depending on the application to solve and the degree of computing overhead needed: gradient descent, stochastic gradient descent, negative sampling, etc. We will use gradient descent in our example.

Backpropogation calculates the gradient of the loss function with respect to the weights efficiently rather than looking at a direct computation of each individual weight and does it in a way where the gradients do not have to be continuously recomputed. We show you how to do this in our example below.

This gives the change to the weights as:

A larger alpha value will make the network ‘learn’ more quickly but will also make it susceptible to overshooting the minimum. On the other hand, a lower alpha value will make the network ‘learn’ more slowly as the weights are not moved as much.

The weight is changed as:

Gradient Descent – Output Weight Matrix

To carry out the gradient descent (and indeed any of the other techniques to change the weights in the weight matrices) we need to compute the partial derivative of the loss function with respect to w as discussed above. Recall,

output of neuron i in the output layer
output of neuron i in the output layer
output from the network, the softmax function
output from the network, the softmax function

A quick note on notation- i represents neuron _i, NHj is the output of the _j_th neuron in the hidden layer, and _wOji is the weight in the output weight matrix between the _j_th neuron in the hidden layer and the _i_th neuron in the outer layer.

We will use the chain rule of calculus to find the gradient of the loss function with respect to the output weight matrix:

We are going to break down each component in the equation:

the derivative with respect to the softmax output from the network is the difference between the ith softmax output vector and the expected output after softmax
the derivative with respect to the softmax output from the network is the difference between the ith softmax output vector and the expected output after softmax

For classification tasks, the cross entropy or log loss function is generally chosen and for regression or multi-class classification tasks, the squared error loss function is generally chosen. We have used the squared error loss function.

the derivative with respect to the output layer is equivalent to the softmax function
the derivative with respect to the output layer is equivalent to the softmax function
the derivative with respect to the output weight matrix is the hidden layer
the derivative with respect to the output weight matrix is the hidden layer

Therefore, we can now update the output weight matrix:

Gradient Descent – Input Weight Matrix

Here, we just follow the same procedure as for the output weight matrix:

which can be expanded out to:

We already have the first two derivatives from the output matrix gradient descent; therefore, we only need to calculate the remaining two.

Therefore, we can now update the input weight matrix:

Step 4: Train the network: Use Backpropogation

Now it is time to apply the gradient descent and use the formulae above to update the weight matrices- we start with the output weight matrix. We calculated the prediction O’ in Step 3 and we can obtain Y from the co-occurrence vector for ‘thorn’ and pass it through the softmax function to achieve a probability. In our example, 4 words are within a context window of 2 of ‘thorn’ so we attribute a probability of 0.25 to each of them. From this, we are able to calculate the gradient based on the formulae above.

We then use this gradient matrix to update the output matrix as:

assuming an alpha = 1, we just subtract the gradient of the output weight matrix from the original output weight matrix to achieve a new output weight matrix
assuming an alpha = 1, we just subtract the gradient of the output weight matrix from the original output weight matrix to achieve a new output weight matrix

The same process is replicated using the input weight matrix formula.

Step 5: Word Embedding Analysis

We have now shown one full epoch. After running through many epochs, the model will become trained. This is when we can begin to have some fun and do linear algebra arithmetic with words. Some examples of analysis include computing the degree of similarity between words, completing analogies, and finding a word in a list that doesn’t belong. Of course with our example network, the corpus is too small to do any meaningful analysis, so we use Google’s pre-trained dataset with 3 million words and 300-dimension hidden layer. We will end where we began and calculate: (Worst-Best)+Catan = ?

‘Best’ is to ‘Worst’ as ‘Catan’ is to ?

The network assigns ‘Monopoly’ as the answer with the highest probability at: 0.77! We can also create a 2-dimensional Principle Component Analysis model to plot the word vectors in a scatter plot.

Additionally, we can look at similar words and see how they are grouped. We see that words with similar semantic meanings (such as the list of countries or the list of drinks) have successfully been grouped together.

Conclusion

We have taken the Word2Vec algorithm out of the box and shown how it works, giving you the mechanics to build your own analysis network.

This is an extremely powerful and surprisingly successful algorithm for analysing language and can also be a lot of fun with many applications.

A large corpus can be analyzed fairly quickly with a modern laptop and programming environment like Python. We have only touched the surface on Word2Vec so with this introduction, there are fascinating areas where you can continue your research such as: faster algorithms like negative sampling and stochastic gradient descent and different types of networks like LSTM Neural Networks or RNNs.

We have enjoyed exploring this fascinating science and hope that you have too.

All non-referenced images were created by the authors.

References

[1] T. Mikolov, K.Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013), arXiv:1301.3781

[2] C. McCormick, Word2Vec Tutorial – The Skip-Gram Model (2016, April 19), http://www.mccormickml.com

[3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (2016), MIT Press


Related Articles