The world’s leading publication for data science, AI, and ML professionals.

NLP Illustrated, Part 3: Word2Vec

An exhaustive and illustrated guide to Word2Vec with code!

Welcome to Part 3 of our illustrated journey through the exciting world of Natural Language Processing! If you caught Part 2, you’ll remember that we chatted about word embeddings and why they’re so cool.

NLP Illustrated, Part 2: Word Embeddings

Word embeddings allow us to create maps of words that capture their nuances and intricate relationships.

This article will break down the math behind building word embeddings using a technique called Word2Vec – a Machine Learning model specifically designed to generate meaningful word embeddings.

Word2Vec offers two methods – Skip-gram and CBOW – but we’ll focus on how the Skip-gram method works, as it’s the most widely used.

These words and concepts might sound complex right now but don’t worry – at its core, it’s just some intuitive math (and a sprinkle of machine learning magic).

Real quick – before diving into this article, I strongly encourage you to read my series on the basics of machine learning. A couple of concepts (like gradient descent and loss functions) build on those fundamentals, and understanding them will make this article much easier to follow.

Machine Learning Starter Pack

That said, don’t worry if you’re unfamiliar with those concepts – this article will cover them at a high level to ensure you can still follow along!


Since Word2Vec is a machine-learning model, like any ML model, it needs two things:

  • Training data: text data to learn from
  • A problem statement: the question the model is trying to answer

Training data

We’re trying to create a map of words, so our training data is going to be text. Let’s start with this sentence:

This will be our toy training data. Of course, in the real world, Word2Vec is trained on massive corpora of text – think entire books, Wikipedia, or large collections of websites. For now though, we’re keeping it simple with just this one sentence, so the model will only learn embeddings for these 18 words.

A problem statement

For Word2Vec, the core problem is simple: Given two words, determine whether they are neighbors

To define "neighbors," we use something called a context window, which specifies how many neighboring words on either side to consider.

For instance, if we want to find the neighbors of the word "happiness"…

…and set the context window size to 2, the neighbors of "happiness" will be "can" and "be".

And here, if we input "happiness" and "can" into the model, ideally we want it to predict that they are neighbors.

Similarly, for the word "darkness," with a context window of 2, the neighbors would be "in" and "the" (before), and "of" and "times" (after).

If we set our context window to 3, the neighbors for "happiness" will be three words on either side.

Terminology segway: Here "happiness" is referred to as the target word, while the neighboring words are known as the context words.

By default, the context window size in Word2Vec is set to 5. However, for simplicity in our example, we’ll use a context window size of 2.

Now, we need to convert this sentence into a neat little table, just like we do for other machine learning problems, with clearly defined inputs and output values.

We can construct this dataset by pairing the target word with each of its context words as inputs…

…and the output will be a label indicating whether the target and context words are neighbors:

1 indicates that they are neighbors

But there’s a glaring issue with this. All our training pairs are positive examples (neighbors), which doesn’t teach the model what non-neighbors look like.

Enter Negative Sampling.

Negative Sampling introduces pairs of words that are not neighbors. So for instance, we know that "happiness" and "light" are not neighbors, so we add that data to our training data with the label 0 to indicate that they are not neighbors.

By adding negative samples, the final dataset contains a mix of positive and negative pairs so that the model can learn to predict whether a given pair is a true neighbor or not.

Typically, we use 2 — 5 negative samples per positive pair for large datasets and up to 10 for smaller ones.

We’ll use 2 negative pairs per positive pair. Our training dataset now looks like this:

Now comes the fun part – the machine learning magic. Here’s the problem we’re solving: Given a target word and a context word, predict the probability that they are neighbors.

Let’s break it down step by step.

Step 0: Decide embedding dimensions

The first thing we do is to decide the size of the word embeddings. As we’ve learned, larger embeddings capture more nuances and richer relationships but come at the cost of increased computational expense.

The default embedding size in Word2Vec is 100 dimensions, but to keep the explanation simple, let’s use just 2 dimensions.

This means each word will be represented as a point on a 2D graph like so:

Step 1: Initialize embedding matrices

Next, we initialize two distinct sets of embeddings – target embeddings and context embeddings.

And, at the start of training, these embeddings are randomly initialized with values:

The target embeddings and context embeddings are randomly initialized with different values because they serve distinct purposes.

  • Target Embeddings: Represent each word when it’s the target word in training
  • Context Embeddings: Represent each word when it’s a context (neighboring) word

Step 2: Calculate the similarity of target word and context word

In the training process, we work with blocks of one positive pair and their corresponding negative samples.

So in the first pass, we only focus on the first positive pair and its corresponding 2 negative samples.

Now we can determine how similar 2 words are by calculating the dot product of their embeddings: the target embedding (if its a target word) and the context embedding (if its a context word).

  • A larger dot product indicates the words are more "similar" (likely neighbors)
  • A smaller dot product suggests they are more dissimilar (less likely to be neighbors)

And remember, in the first pass, we only calculate the similarity of the 3 pairs in the first block.

Let’s start by taking the dot product of the target word embedding of "happiness" with the context word embedding of "can":

We get:

Now we need to find a way to convert these scores to probabilities because we want to know how likely is it that these two words are neighbors. We can do that by passing this dot product through a sigmoid function.

As a quick refresher, the sigmoid function squishes any input value into a range between 0 and 1, making it perfect for interpreting probabilities. If the dot product is large (indicating high similarity), the sigmoid output will be close to 1 and if the dot product is small (indicating low similarity), the sigmoid output will be closer to 0.

So passing the dot product, -0.36, through the sigmoid function, we get:

Similarly, we can calculate the dot product and corresponding probabilities for the other two pairs…

…to get the predicted probability that "happiness" and "light" are neighbors…

…and the predicted probability that "happiness" and "even" are neighbors:

This is how we calculate the model’s predicted probabilities of these 3 pairs being neighbors.

As we can see, the predicted values are pretty random and inaccurate, which makes sense because the embeddings were initialized with random values.

Next, we move on to the key step: updating these embeddings to improve the predictions.

Step 4: Calculate error

NOTE: If you haven’t read the article on Logistic Regression, it might be helpful to do so, as the process of calculating error there is very similar. But don’t worry, we’ll also go over the basics here.

Now that we have our predictions, we need to calculate the "error" value to measure how far off the model’s predictions are from the true labels. For this, we use the Log Loss function.

For every prediction, the error is calculated as:

And the overall Log Loss for all predictions in the block is the average of the individual prediction errors:

For our example, if we calculate the loss for the 3 pairs above, it will look like this:

Evaluating this…

…we get 0.3. Our goal is to reduce this loss to 0 or as close to 0 as possible. A loss of 0 means that the model’s predictions perfectly match the true labels.

Step 4: Update embeddings using gradient descent

Again won’t dive into the details here since we covered this in our previous article on Logistic Regression. However, we know that the best way to minimize the loss function is by using gradient descent.

To put it simply, Log Loss is a convex function…

…and gradient descent helps us find the lowest point on this curve – the point where the loss is minimized.

It does so by:

  • calculating the gradient (the slope) of the loss function with respect to the embeddings and
  • adjusting the embeddings slightly in the opposite direction of the gradient to reduce the loss

So once gradient descent works its magic, we get new embeddings like so:

Let’s visualize this change. We start with our target embedding ("hapiness") and context embedding ("can", "light" and "even") in our block.

And after gradient descent, they shift slightly like so:

This is the REAL magic of this step. We see that automatically:

  • for the positive pair, the target embedding of"happiness" is nudged closer to the context embedding of "can," its neighbor
  • and for the negative pairs, the target embedding ("happiness") is adjusted to move further away from the non-neighboring context embeddings of "light" and "even"

Step 5: Repeat steps 2–4

Now all we have to do is rinse and repeat steps 2- 4 using the next block of positive and negative pairs.

Let’s see what this looks like for the second block.

For these values, we determine the model’s predictions of whether the words are neighbors or not by:

(1) Taking dot products and passing them through the sigmoid function…

(2) And then using the Log Loss and gradient descent we update the target and context embedding values for the words in this block:

Again, doing so will nudge the neighboring word embedding closer together and dissimilar ones are pushed farther apart.

That’s pretty much it. We just repeat these steps with each block in our training data.

Sidenote: Going through all blocks in the training dataset once is called an epoch. We usually repeat this for 5–20 epochs for a super robust training process.

By the end of our full training process, we’ll get up with our final target and embeddings that look something like this:

If we get rid of the context embedding, we are left with just the final target embeddings.

And these final target embeddings are the word embedding that we were after at the beginning!!

SIDENOTE: If needed, the context embeddings could be averaged or combined with the target embeddings to create a hybrid representation. However, this is rare and not standard practice.

This happens because the training process refines embeddings based on word relationships. Similar words (neighbors) are pulled closer together, while dissimilar words (non-neighbors) are pushed apart. While doing so, it also ends up capturing deeper relationships between words, including synonyms, analogies, and subtle contextual similarities.

Here, our training data was just a single sentence with 18 words, so the embeddings may not seem meaningful. But imagine training on a massive corpus – an entire book, a collection of articles, or billions of sentences from the web.

And that’s it! That’s how we create word embeddings using Word2Vec, specifically the skip-gram method.

Word2Vec IRL

Now that we’ve unpacked the mathematical magic behind Word2Vec, let’s bring it to life and create our own word embeddings.

Use pre-trained word embeddings

The easiest and most efficient way to get started is to use pre-trained word embeddings. These embeddings are already trained on massive datasets like Google News and Wikipedia, so they’re incredibly robust. This means we don’t have to start from scratch, saving both time and computational resources.

We leverage some pre-trained Word2Vec embeddings using Gensim, a popular Python library for NLP that’s optimized for handling large-scale text processing tasks.

# install gensim 
# !pip install --upgrade gensim

import gensim.downloader as api

Let’s look at all available pre-trained Word2Vec models in Gensim:

available_models = api.info()['models']

print("Available pre-trained Word2Vec models in Gensim:n")
for model_name, details in available_models.items():
    if 'word2vec' in model_name.lower():  # find models with 'word2vec' in their name
        print(f"Model: {model_name}")
        print(f"  - Description: {details.get('description')}")
Available pre-trained Word2Vec models in Gensim:

Model: word2vec-ruscorpora-300
  - Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Model: word2vec-google-news-300
  - Description: Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).
Model: __testing_word2vec-matrix-synopsis
  - Description: [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.

We see that there are two usable pre-trained models (since one of the models is labeled test). Lets’ put the word2vec-google-news-300 model to test!

Here’s how to find synonyms of the word "beautiful":

w2v_google_news.most_similar("king")
[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

These all make sense.

If you recall from the previous article, we saw how we can perform mathematical operations on word embeddings to get intuitive results. One of the most popular examples of this is…

…which we can test like so:

# king + woman - man
w2v_google_news.most_similar_cosmul(positive=['king', 'woman'], negative=['man'])

The results are impressively accurate!

Let’s try another combination:

# better + bad - good
w2v_google_news.most_similar_cosmul(positive=['better', 'bad'], negative=['good'])
[('worse', 0.9141383767127991),
 ('uglier', 0.8268526792526245),
 ('sooner', 0.7980951070785522),
 ('dumber', 0.7923389077186584),
 ('harsher', 0.791556715965271),
 ('stupider', 0.7884790301322937),
 ('scarier', 0.7865160703659058),
 ('angrier', 0.7857241034507751),
 ('differently', 0.7801468372344971),
 ('sorrier', 0.7758733034133911)]

And "worse" is the top match! Very cool.

As we can see, these pre-trained models are incredibly robust and can be leveraged for most use cases. However, they’re not perfect for every situation. For instance, if we’re working with niche domains like legal or medical texts, general-purpose embeddings may fail to capture the specific meanings and nuances of the language.

Say we have this legal text:

"The appellant seeks declaratory relief under Rule 57, asserting that the respondent’s fiduciary duty was breached by non-disclosure of material facts in accordance with Section 10(b) of the Securities Exchange Act of 1934."

Legal documents are often written in a formal, highly structured style, with terms like "Rule 57" or "Section 10(b)" referencing specific laws and statutes. Words like "material facts" have a precise legal meaning – facts that can influence the outcome of a case – which is very different from how "material" is understood in everyday language.

Pre-trained embeddings trained on general corpora, such as Google News, won’t capture these nuanced, domain-specific meanings. Instead, for tasks like this, we need embeddings trained on domain-specific corpora, such as legal judgments, statutes, or contracts.

Code our own Word2Vec from scratch

This is where building our own Word2Vec model is helpful. By training on a legal corpus, we can create embeddings tailored to our use case, capturing the relationships and meanings specific to the legal domain.


And just like that we’re done! You now know everything you need to know about Word2Vec.

As always, feel free to connect with me on LinkedIn or email me at shreya.Statistics@gmail.com!

Unless specified, all images are by the author.


Related Articles