Hands-on Tutorials
Understanding the math and implementation of Word2Vec

Man is to king as woman is to what?
How can you train a Machine Learning algorithm to correctly predict the right answer, "queen"?
In this article, I’m going to talk about a natural language processing model – Word2Vec -that can be used to make such predictions.
In the Word2Vec model, every word of the vocabulary is represented by a vector. The length of this vector is a hyperparameter that we the human set. Let’s use 100 in this article.
Since every word is a vector of 100 numbers, you can think of every word as a point in a 100 dimensional space.
The end goal for Word2Vec is that similar words have similar vectors, and thus end up grouped closer together in this 100 dimensional space.
So, how do you train an algorithm so that similar words end up having similar vectors? And how does this help solve the analogy problem "man is to king as woman is to what?"
Part 1. Training Word2Vec with gradient descent
To train Word2Vec, you need a large corpus of text. For example, the Brown Corpus. NLTK has compiled a list of corpus available for access.
At its core, Word2Vec is all about predicting the probability of one word appearing near another word (in a sentence).

In the example above: "…cherish the pale blue dot, the only home we’ve ever known", "blue" is the center word, and "pale" an outside or context word. We want the algorithm to predict the probability that "pale" is a context word of "blue." Note: in this article, I will be using outside word and context word interchangeably.
We calculate a probability for each context word that’s within a defined window size. If the window size is 1, we want to predict the probability of the word immediately preceding and following the center word. If the window size is 2, we want to predict the probability of the 2 words immediately preceding and following the center word.
Thus, for the algorithm, we go through every word in the corpus. At each position t, we calculate the probability of context words within window w given the center word at position t.

A good model will give high probability for these context words, and a bad model low probability.
The difference between the actual and the predicted is factored into the loss, which is used in gradient descent to update the parameters (theta) of the model. And there you have it, the central theme of many supervised learning algorithm.

In Word2Vec, what is exactly is theta? And what is the loss function?
Parameters and loss function of Word2Vec
Recall that every word is represented by a vector of 100 numbers. Also recall that every word can serve as a center word or a context word. Thus, every unique word in the corpus is represented by 2 vectors: one vector for it as the center word and another for it as a context word.

The parameters theta of Word2Vec are the center and outside vectors of all the unique words in the corpus. The dimensions of theta are 2 times length of vocabulary rows and 100 columns.
In the image, I used lowercase v to represent the center word vector of a word, and lowercase u to represent the context word vector of a word. Thus, _vzebra is the vector for zebra when zebra is the center word. _uzebra is the vector for zebra when zebra is a context word of some other center word. Later on, I will use the capital U to refer to a matrix containing all of the outside vectors u.
The loss function we’ll use is the negative log likelihood (NLL) loss. NLL is used in many machine learning algorithms, like logistic regression.
The equation for NLL is:

We sum over length of unique words T, and at each index t, we sum the probability of each context word (within window w) given the center word at index t.
Thus, the loss of one context word given a center word is:

How do you compute the probability? We use the Softmax equation. Below is how you calculate the probability of a context word (o) given a center word (c).

The nominator is the exponent of the dot product of the context vector for the context word o with center vector for center word c. The denominator is the sum of the exponent of the dot product of each vocabulary word’s context vector with the center word vector. Wow that was a mouthful!
As you can probably already guess, this naive-softmax probability is expensive to compute since we have to sum over the entire vocabulary. There is another more efficient ways to do loss; I’ll describe it later.
In gradient descent, you take the derivative of the loss function with respect to the parameters to find how to update the parameters in each iteration.
For a given center word, we need to calculate the gradient for both the center word vector as well as the gradient for all the context word vectors.

I won’t do the derivations here, but to compute the gradient for center vector, you take the dot product of U and (y_hat minus y). U is, as I’ve mentioned, the matrix of all outside vectors of the entire vocabulary. To compute the gradients of outside vectors, you take the dot product of the center vector and (y_hat minus y) transposed. y_hat in this case is the Softmax of the dot product of U and the center vector.
Overall, the algorithm works like so:
For each iteration:
- Calculate the loss and gradients of a mini-batch (e.g. of size 50).
- Update model parameters like so:
theta = theta - learning_rate * gradients
In order to calculate the gradients of a mini-batch of size 50, you iterate 50 times, and during each iteration, you:
- Get a random center word and its context words.
- For each context word, calculate the gradients (by taking the derivative of the loss function).
- Aggregate all the gradients of the context words.
The aggregated gradients of a mini-batch are then used to update theta.
Here’s a Github gist with some of the relevant code. Note: I’ve excluded some parts like parsing the corpus and the code for Softmax equation.
The algorithm I just described is called the Skip-Gram algorithm. In Skip-Gram, you predict the probability of a context word given a center word. There’s also another algorithm called CBOW, where you predict the probability of center word given its context instead.
As I’ve mentioned, the loss function we used is expensive to compute. There’s another technique for training Word2Vec called "negative sampling." In negative sampling, instead of the cross-entropy loss (i.e. negative log likelihood of Softmax function), you train a binary logistic regression for a true pair vs. several noise pairs. The true pair is a center word and one of its context words. A noise pair is the same center word and a random word. In negative sampling, the loss function is much more efficient since you’re not summing over the whole vocabulary.

Part 2. How Word2Vec solves analogies
Man is to king as woman is to what?
When you ask a Word2Vec algorithm to solve this analogy, you’re essentially asking it to find a word that’s most SIMILAR to "king" and "woman", but most DISSIMILAR to "man." The similarity here refers to the cosine similarity of two vectors.

Thus, after training, the word vector for "queen" is most similar to the vectors for "king" and "woman", but dissimilar to "man."
Additionally, since similar words are clustered together in the higher dimensional space, you can think of the vector for "queen" as approximately the vector for "king" minus the vector for "man" plus the vector for "woman."


And here it is, how we can train computers to solve analogies!
Finding similar words and solving analogies are just two basic applications of Word2Vec. Word2Vec embeddings can also be used for more complex tasks such as sentiment analysis. For sentiment analysis, the features would be the average over all the vectors of words in a sentence. The label would, of course, be a sentiment (positive, negative, neutral). Then you could do a multi-class classification with classifiers such as Random Forest, Softmax Regression, or even a neural network!
Stay tuned for more articles on NLP!