Word2Vec Implementation

How to implement Word2Vec using numpy and python

Rahuljha
Towards Data Science

--

This article is about the implementation of a very popular word embedding technique known as Word2Vec. It was implemented by Tomas Mikolov at Google.

Contents:

A — Concept

  • Objective
  • Introduction
  • Core idea
  • Architecture

B — Implementation

  • Data preparation
  • Model training
  • Model inference and analysis

Objective

The objective of this article to show the inner workings of Word2Vec in python using numpy. I will not be using any other libraries for that. This implementation is not an efficient one as the purpose here is to understand the mechanism behind it. You can find the official paper here.

Introduction

Computers only understand the language of numbers. The way we encode our text data to numbers influences the outcome a lot. In general there are 3 techniques which are used to perform this task.

Out of these, word2vec performs incredibly well in NLP tasks. The core idea behind the concept is very simple yet it produces amazing results.

Core idea

“A man is known by the company he keeps”

Aesop

This is a very well know saying. And word2vec also works primarily on this idea. A word is known by the company it keeps. This sounds so strange and funny but it gives amazing results. Let us try to understand it a bit more.

Some sample sentences:

  • The rewards of all your hard work in the garden are easy to see.
  • It is hard work to keep my room in proper order.
  • The hard work began to tell on him.

Here we can easily see that the words “hard” and “work” occur in close vicinity. This seems very easy as a human being to observe, but for computers it is a very difficult task. So when we vectorize (turn words to numbers) these words, it seems obvious that there representation as numbers should be similar or close. This is exactly what word2vec achieves and with a very good result. Enough said, now time to get our hands dirty!

Architecture

So with the core idea understood, we understand that the algorithm is based on recognizing words in vicinity of each other. In other words we can say that if the computer tries to learn that the words “hard” and “work” occur in close proximity of each other then it will learn the vectors according to that.

If we say that our “target” word is “hard” for which we need to learn a good vector, we provide the computer with its nearby word or the “context” word which is “work” in this case amongst “the, began, is etc”.

There are two main architectures which try to learn the above. Skip gram and CBOW

Skip-gram : So we understood about the concept of target and context words. This model, tries to learn the context words for each of the target words.

Fig_1: Skip gram architecture . Credits

Intuition:

Text : ['Best way to success is through hard work and persistence']

So for this model our input will be like :

  • Target word : Best. Context word : (way).
  • Target word : way. Now here we have two words, one before and after. So in this scenario our context words will be : (Best,to).
  • Target word : to. Now here also we have two words, one before and after i,e (way,success). But if we think again, then “to” and “is” can be found in sentences nearby to each other. Like “He is going to the market”. So it is a good idea if we include the word “is” in the list of context words. But now we can argue about “Best” or “through”. So here comes the concept of “window size”. Window size is the number which we decide that how many nearby words are we going to consider. So if the window size is 1 then our list of context words become (way,success). And if the window size is 2 then our list of context words become (Best,way,success,is).
  • Similarly we can formulate the list for the rest of the words

So we see here there is an input layer, a hidden layer and a output layer. We can also see there are two sets of weights (W, W`).

CBOW: Context bag of words. Conceptually this is just the opposite of skip-gram. Here we try to predict the target word from a list of context words. So for our example, we will have the input as (Best,way,success,is) and we will need to predict “to” from it.

Fig_2: Cbow model architecture

We can see here it is just the opposite of skip-gram model.

Implementation process

In this article, I will be implementing the skip-gram model.

Data preparation

To train a model to learn good vectors for words, we will need a huge amount of data. But in this article i will try to showcase the workings on a very small data set. The data set consists of plots of various stories of Jeffery Archer taken from Wikipedia.

Step 1: Read data from a file

Explanation:

  • Line 2–4 : Read contents of the text file to a list
  • Line 7–8: Keep only the alphabets and remove everything else from each line
  • Line 9–17: Iterate each word in the sentence and remove stopwords if specified

Step 2: Generate variables

We will need some variables which will come handy to us in further sections.

word_to_index : A dictionary mapping each word to an integer value {‘modern’: 0, ‘humans’: 1} index_to_word : A dictionary mapping each integer value to a word {0: ‘modern’, 1: ‘humans’}corpus : The entire data consisting of all the words vocab_size : Number of unique words in the corpus

Explanation:

  • Line 10 : Convert each word to lowercase
  • Line 12–15: Update the dictionaries with the word and there counts if the word is not already present in the dictionary

The output of this code will give:

text = ['Best way to success is through hardwork and persistence']Number of unique words: 9word_to_index :  {'best': 0, 'way': 1, 'to': 2, 'success': 3, 'is': 4, 'through': 5, 'hardwork': 6, 'and': 7, 'persistence': 8}index_to_word :  {0: 'best', 1: 'way', 2: 'to', 3: 'success', 4: 'is', 5: 'through', 6: 'hardwork', 7: 'and', 8: 'persistence'}corpus: ['best', 'way', 'to', 'success', 'is', 'through', 'hardwork', 'and', 'persistence']Length of corpus : 9

Step 3: Generate training data

Before seeing the code let us first understand some concepts.

  • One-hot-vector : Basically this is a way to encode data with 0’s and 1’s . Let us say we have a text : “ hi john”. So our one-hot-vector will be of size 2 as we have two words and we will have two separate vectors, 1 for hi and 1 for john. Example: (word: hi , one-hot-vector: [1, 0]) , (word: john , one-hot-vector: [0, 1]

The following code generates one-hot-vectors for our data:

Explanation:

text = ['Best way to success is through hardwork and persistence']Window size = 2, Vocab size = 9
We will set the indices as 1 according to the word_to_index dict i.e best : 0, so we set the 0th index as 1 to denote naturalTarget word = best
Context words = (way,to)
Target_word_one_hot_vector = [1, 0, 0, 0, 0, 0, 0, 0, 0]
Context_word_one_hot_vector = [0, 1, 1, 0, 0, 0, 0, 0, 0]
Target word = way
Context words = (best,to,success)
Target_word_one_hot_vector = [0, 1, 0, 0, 0, 0, 0, 0, 0]
Context_word_one_hot_vector= [1, 0, 1, 1, 0, 0, 0, 0, 0]

There is an alternate way to generate the context_word_one_hot_vector.

Target word = best    
Context words = (way,to)
Target_word_one_hot_vector = [1, 0, 0, 0, 0, 0, 0, 0, 0]
Context_word_one_hot_vector = [0, 1, 0, 0, 0, 0, 0, 0, 0],[0, 0, 1, 0, 0, 0, 0, 0, 0]

Instead of having the indices present in a single list, we have two different list. But the problem with this approach is that it will take a lot of space if the size of our data increases. I have illustrated that in a python script. Refer to the code to see the difference.

Now we have the vectors generated for target word and context word. To train a model, we need to have the data in the form of (X,Y) i.e (target_words, context_words).

This is achieved by the following code:

Explanation:

text = ['Best way to success is through hardwork and persistence']
  • Line 7: Iterate the corpus
  • Line 9: Set the ith word as the target word
  • Line 14,21,27 : Condition to check if the ith word in line 9 is the (first :Best) , (middle : way) or the (last : persistence) word .
  • Line 17 : If it is the first word, get the next 2 (window_size =2) words and set them as context words
  • Line 21 : If it is the last word, get the previous 2 (window_size =2) words and set them as context words
  • Line 30,37 : If our ith word is a middle word, then we need to get 2 (window_size =2) words before the ith word and 2 (window_size =2 ) words after the ith word and set all 4 as the context words. If there is only 1 word before or after the ith word, we get only 1 word.

Example:

**************************************************
Target word:best . Target vector: [1. 0. 0. 0. 0. 0. 0. 0. 0.]
Context word:['way', 'to'] .
Context vector: [0. 1. 1. 0. 0. 0. 0. 0. 0.]
**************************************************
Target word:way . Target vector: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
Context word:['best', 'to', 'success'] .
Context vector: [1. 0. 1. 1. 0. 0. 0. 0. 0.]
**************************************************
Target word:hardwork . Target vector: [0. 0. 0. 0. 0. 0. 1. 0. 0.]
Context word:['through', 'is', 'and', 'persistence'] .
Context vector: [0. 0. 0. 0. 1. 1. 0. 1. 1.]
**************************************************
Target word:and . Target vector: [0. 0. 0. 0. 0. 0. 0. 1. 0.]
Context word:['hardwork', 'through', 'persistence'] .
Context vector: [0. 0. 0. 0. 0. 1. 1. 0. 1.]
**************************************************
Target word:persistence . Target vector: [0. 0. 0. 0. 0. 0. 0. 0. 1.]
Context word:['and', 'hardwork'] .
Context vector: [0. 0. 0. 0. 0. 0. 1. 1. 0.]

Model training

Fig_3: Training of a skip gram

The above image shows how a neural network is trained, in this case a skip-gram model. Here we can see there is only one hidden layer. When we train a neural network in deep learning, we tend to have several hidden layers. This is where this skip-gram works so well. In spite of just having a single hidden layer it is a state-of-the-art algorithm.

Forward propagation

Here we have the following parameters:

  • Input: The input we give to the model. Target words in our scenario
  • W_1 or weight_inp_hidden : First set of weights of which is multiplied with the input to give the hidden layer.
  • W_2 or weight_hidden_output: Second set of weight multiplied with hidden layer.
  • Softmax layer: This is the final layer to squish the output probabilities between 0 and 1. A good explanation of the softmax function can be found here.

Training error:

Now once we have done 1 round of forward propagation we get some output value. So obviously we will have some error in our prediction as compared to the original value.

Explanation: This is back-propagated to update the weights for the next iteration

Example: below if we have 2 context words. These are not actual values.These are just for showing how the error is calculatedcontext_words = [1 0 0 1 0]y_pred = [9 6 5 4 2]So if we break the context_word vector : [1 0 0 0 0] and [0 0 0 1 0] . 1 at index 0 and 3The error should be calculated as :diff_1 = y_pred - context_word_vector_1 =
[9 6 5 4 2] - [1 0 0 0 0] = [8 6 5 4 2]
diff_2 = y_pred - context_word_vector_2 =
[9 6 5 4 2] - [0 0 0 1 0] = [9 6 5 3 2]
Total_error = diff_1 + diff_2(column_wise) = [17 12 10 7 4]Since our context vector has only 1 array , we implement the above as:index_of_1_in_context_words -> Line (6,7)
A dictionary which has the index of 1's in the context_word_vector -> {0: 'yes', 3: 'yes'}
number_of_1_in_context_vector -> A count for the above -> 2We loop the y_pred array and do the calculations as:for i,value in enumerate(y_p):Line(13,14) if the ith index of y_pred has a 1 in context_word_vector:
total_error[i] -> i:0 . y_pred[i]:9. -> (9-1) + (1*9)
->error_calculated: 17
total_error[i] -> i:3 . y_pred[i]:4. -> (4-1) + (1*4)
->error_calculated: 7
Line(15,16) else:
total_error[i] -> i:1 . y_pred[i]:6. -> 6*2
->error_calculated: 12
total_error[i] -> i:2 . y_pred[i]:5. -> 5*2
->error_calculated: 10
total_error[i] -> i:4 . y_pred[i]:2. -> 2*2
-> error_calculated: 4
total_error -> [17 12 10 7 4]

Back propagation

With the error calculated above we need to update the weights (W_1 and W_2) so that our network tries to rectify the error.

Here is a good link that explains the derivative equations for back propagation.

Finally loss calculation

The loss is calculated as:

Fig_4: Loss function. Credits

If we look a bit deep into the loss function E, we see that we are trying to optimize the probability of finding correct context words p(WO,1, WO,2, · · · , WO,C) given our WI (the input word). So the loss will decrease as we come nearer to a distribution that finds the correct context words for each given target words.

Explanation:

The loss function is comprised of two parts.

  1. 1st part : We take a negative of the sum of the values where we have 1 for the context words
  2. 2nd part : We take a exp of u , which is the output we get after multiplying the second set of weights with the hidden layer.
u : [ 0.3831286   0.89608496  2.69426738 -1.60230182  0.45482701  0.73644591 1.10365796  1.1675781  -0.78555069]context: [0, 1, 1, 0, 0, 0, 0, 0, 0]sum_1 = -(0.89608496 + 2.69426738)sum_2 = number_of_context_words * np.log(np.sum(np.exp(u)))total_loss = sum_1 + sum_2

Hyperparameters

Till now we can see that a lot of variables are involved in the process. So a very big part of training is finding the right set of variables that gives the best result. We will go by these variables one by one.

  1. Window size : This is the number of context words we are going to have for each target word.
  2. Learning rate : If we see the backward propagation code, we see that the weights are multiplied with the learning rate. This is part of the optimization process called gradient descent. A good learning rate defines how quickly our model reaches its optimum value.
  3. Epochs : Let us say we have 100 training examples. So when we do the entire process as above for each 100 records, that is counted as 1 epoch.
  4. Dimension : When our final model is ready, we get the vectors for each word. That vector can of various dimension ranging from 50 to 300.
  5. Stopwords : These are words like a , an, the. They do not have any meaning by themselves so we can check on how our model works with and without them.

I have done some fun experiments with window size, epochs ,dimensions ,stopwords while keeping the learning rate as 0.01

Model inference and analysis

To get the word vectors for each word, after the the final training, the first set of weights i.e weight_1 is used to retrieve the vector for each word.

To find the similar set of words, cosine similarity is used. It is a metric that measures how similar two vectors are in a multidimensional space. Basically it measures the angle between two vectors to find how similar they are. A good explanation of it can be found here.

Now i used two different sets of data, one a single line of text and the other a text corpus.

Inference with a single line of text as input

Text : [‘best way to success is through hardwork and persistence’]

The scatter plots show below are of 2 dimension. This is done through dimensionality reduction. I have used T-SNE for it. The figure below does not give an exact picture of as we are compressing several dimensions into 2.

Scatter plot with varying dimension:

Fig_5:Varying dim

We can see here that how close “best” and “way” are present. Even though a single line of text is very very scarce for a model to train, yet it learns good embedding.

Scatter plot with varying window size:

Fig_6: Varying window size

Inference with a relatively larger corpus

To see how well the model is working, i printed a similarity matrix for some words with a larger corpus

Varying dimension :

Fig_7: Varying dimension similarity matrix

One thing to observe here is that for higher dimension, the numbers for similarity is low. The reason behind this is that the corpus is very small with only around 700 unique words. So to learn embedding for a larger dimension, a huge corpus is needed. The word2vec algorithm is trained on a corpus of size in millions.

Stopwords:

Fig_8: Stopwords effect

The model seemed to perform better without stopwords as the loss curve was better for each epoch. To have a better understanding of it we need to try it on a bigger corpus.

The scatter plot for above can be found at my github link here.

Further improvements:

Training of word2vec is a very computationally expensive process. With millions of word the training may take a lot of time. Some methods to counter this are negative sampling and Hierarchical softmax. A good link to understand both can be found here.

The entire code can be found at my github repository :

https://github.com/rahul1728jha/Word2Vec_Implementation/blob/master/Word_2_Vec.ipynb

Please leave comments for any clarifications or questions.

Happy learning 😃

--

--