Thoughts and Theory

Word Embeddings is the most fundamental concept in Deep Natural Language Processing. And Word2vec is one of the earliest algorithms used to train word embeddings.
In this post, I want to go deeper into the first paper on word2vec – Efficient Estimation of Word Representations in Vector Space (2013), which as of now has 24k citations, and this number is still growing.
Our plan is the following:
- Review model architectures described in the paper;
- Train word2vec model from scratch using PyTorch;
- And evaluate the word embeddings that we got.
I am attaching my Github project with word2vec training. We will go through it in this post.
Today we are reviewing only the first paper on word2vec. However, there are several later papers, describing the evolution of word2vec:
- Distributed Representations of Words and Phrases and their Compositionality (2013) describes several extensions to the original word2vec to speed up training and improve embedding quality.
- Distributed Representations of Sentences and Documents (2014) shows how to use the idea behind word2vec to create sentence and document embeddings. This approach is known as doc2vec.
- Enriching Word Vectors with Subword Information (2017) introduces even more extensions to word2vec. This approach operates on the character level (not words as previously) and is known as fastText.
I believe, if you understand the first paper, you’ll easily catch the ideas described in later papers. So let’s go!
Disclosure. Wor2vec is already an old algorithm and there are more recent options (for instance, [BERT](https://en.wikipedia.org/wiki/BERT(languagemodel))). This post is for those, who have just started their journey into Deep NLP, or for those, who are interested in reading and implementing papers.
Contents
- What is word2vec?
- Model Architecture
- Data
- Data Preparation
- Text Processing with PyTorch
- Training Details
- Retrieving Embeddings
-
- Visualization with t-SNE
-
- Similar Words
-
- King – Man + Woman = Queen
- What’s Next?
What is word2vec?
Here is my 3-sentence explanation:
- Word2vec is an approach to create word embeddings.
- Word embedding is a representation of a word as a numeric vector.
- Except for word2vec there exist other methods to create word embeddings, such as fastText, GloVe, ELMO, BERT, GPT-2, etc.
If you are not familiar with the concept of word embeddings, below are the links to several great resources. Read through skipping the details but grasping the intuition behind it. And come back to my post for the word2vec details and coding.
- Why do we use word embeddings in NLP by Natasha Latysheva
- The Illustrated Word2vec by Jay Alammar
- An introduction to word embeddings for text analysis by Shane Lynn
Better now?
Word embeddings are used literally in every NLP task – text classification, named-entity recognition, question answering, text summarization, etc. Everywhere. Models do not understand words and letters, they understand numbers. That’s where word embeddings come in handy.
Model Architecture
Word2vec is based on the idea that a word’s meaning is defined by its context. Context is represented as surrounding words.
Think about it. Assume, you are learning a new language. You are reading a sentence and all the words there are familiar to you, except one. You’ve never seen this word before, but you can easily tell its part of speech, right? And sometimes, even guess its meaning. That’s because the information from surrounding words helps you.
For the word2vec model, context is represented as N words before and N words after the current word. N is a hyperparameter. With larger N we can create better embeddings, but at the same time, such a model requires more computational resources. In the original paper, N is 4–5, and in my visualizations below, N is 2.

There are two word2vec architectures proposed in the paper:
- CBOW (Continuous Bag-of-Words) – a model that predicts a current word based on its context words.
- Skip-Gram – a model that predicts context words based on the current word.
For instance, the CBOW model takes "machine", "learning", "a", "method" as inputs and returns "is" as an output. The Skip-Gram model does the opposite.
Both CBOW and Skip-Gram models are multi-class classification models by definition. Detailed visualizations below should make it clear.


What is happening in the black box?
The initial step would be to encode all words with their IDs. ID is an integer (index) that identifies word position in the vocabulary. "Vocabulary" is a term to describe a set of unique words in the text. This set may be all words in the text or just the most frequent ones. More on that in Section "Data Preparation".
Word2vec model is very simple and has only two layers:
- Embedding layer, which takes word ID and returns its 300-dimensional vector. Word2vec embeddings are 300-dimensional, as authors proved this number to be the best in terms of embedding quality and computational costs. You may think about embedding layer as a simple lookup table with learnable weights, or as a linear layer without bias and activation.
- Then comes the Linear (Dense) layer with a Softmax activation. We create a model for a multi-class classification task, where the number of classes is equal to the number of words in the vocabulary.
The difference between CBOW and Skip-Gram models is in the number of input words. CBOW model takes several words, each goes through the same Embedding layer, and then word embedding vectors are averaged before going into the Linear layer. The Skip-Gram model takes a single word instead. Detailed architectures are in the images below.


Where are the word embeddings?
We train the models that are not going to be used directly. We don’t want to predict a word from its context or context from a word. Instead, we want to get word vectors. It turns out that these vectors are weights of the Embedding layer. More details on that are in Section "Retrieving Embeddings".
Data
Word2vec is an unsupervised algorithm, so we need only a large text corpus. Originally, word2vec was trained on Google News corpus, which contains 6B tokens.
I’ve experimented with smaller datasets available in PyTorch:
- WikiText-2: 36k text lines and 2M tokens in train part (tokens are words + punctuation)
- WikiText103: 1.8M lines and 100M tokens in train part
When training word embedding for the commercial/research task – choose the dataset carefully. For instance, if you’d like to classify Machine Learning papers, train word2vec on scientific texts about Machine Learning. If you’d like to classify fashion articles, a dataset of fashion news would be a better fit. That’s because the word "model" means "approach" and "algorithm" in the Machine Learning domain, but "person" and "woman" in the fashion domain.
When reusing trained word embeddings, pay attention to the dataset they were trained on and whether this dataset is appropriate for your task.
Data Preparation
The main step in data preparation is to create a vocabulary. The vocabulary contains the words for which embeddings will be trained. Vocabulary may be the list of all the unique words within a text corpus, but usually, it is not.
It is better to create vocabulary:
- Either by filtering out rare words, that occurred less than N times in the corpus;
- Or by choosing the top N most frequent words.
Such filtering makes much sense because, with a smaller vocabulary, the model is faster to train. On the other hand, you probably do not want to use embedding for words that appeared only once within the text corpus, as these embedding may not be good enough. To create good word embeddings the model should see a word several times and in different contexts.
Each word in the vocabulary has its unique index. Words in vocabulary may be sorted alphabetically or based on their frequencies, or may not – it should not affect the model training. Vocabulary is usually represented as a dictionary data structure:
vocab = {
"a": 1,
"analysis": 2,
"analytical": 3,
"automates": 4,
"building": 5,
"data": 6,
...
}
Punctuation marks and other special symbols may be also added to the vocabulary, and we train embeddings for them as well. You may lowercase all the words, or train separate embeddings for the words apple
and Apple
; in some cases, it may be useful.
Depending on what you want your vocabulary (and word embeddings) to be like – preprocess the text corpus appropriately. Lowercase or not, remove punctuation or not, and tokenize text.

For my model:
- I created vocabulary only from the words that appeared at least 50 times within a text.
- I used basic_english tokenizer from PyTorch that lowercases text, splits it into tokens by whitespace, but putting punctuation into separate tokens.
So, before words go into the model, they are encoded as IDs. The ID corresponds to the word index in the vocabulary. Words that are not in the vocabulary (out-of-vocabulary words) are encoded with some number, for instance, 0.

Text Processing with PyTorch
The full code for training word2vec is here. Let’s go through important steps.
Models are created in PyTorch by subclassing from nn.Module. As described previously, both CBOW and Skip-Gram models have 2 layers: Embedding and Linear.
Below is the model class for CBOW, and here is for Skip-Gram.
import torch.nn as nn
EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1
class CBOW_Model(nn.Module):
def __init__(self, vocab_size: int):
super(CBOW_Model, self).__init__()
self.embeddings = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=EMBED_DIMENSION,
max_norm=EMBED_MAX_NORM,
)
self.linear = nn.Linear(
in_features=EMBED_DIMENSION,
out_features=vocab_size,
)
def forward(self, inputs_):
x = self.embeddings(inputs_)
x = x.mean(axis=1)
x = self.linear(x)
return x
Pay attention, there is no Softmax activation in the Linear Layer. That’s because PyTorch CrossEntropyLoss expects predictions to be raw, unnormalized scores. While in Keras you can customize what the input to CrossEntropyLoss would be – raw values or probabilities.
Model input is word ID(s). Model output is an N-dimensional vector, where N is vocabulary size.
EMBED_MAX_NORM is the parameter to restrict word embedding norms (to be 1, in our case). It works as a regularization parameter and prevents weights in Embedding Layer grow uncontrollably. EMBED_MAX_NORM is worth experimenting with. What I’ve seen: when restricting embedding vector norm, similar words like "mother" and "father" have higher cosine similarity, comparing to when EMBED_MAX_NORM=None.
We create vocabulary from the dataset iterator using the PyTorch function build_vocab_from_iterator. WikiText-2 and WikiText103 datasets have rare words replaced with token , we add this token as a special symbol with ID=0 and all out-of-vocabulary words also encode with ID=0.
from torchtext.vocab import build_vocab_from_iterator MIN_WORD_FREQUENCY = 50
def build_vocab(data_iter, tokenizer):
vocab = build_vocab_from_iterator(
map(tokenizer, data_iter),
specials=["<unk>"],
min_freq=MIN_WORD_FREQUENCY,
)
vocab.set_default_index(vocab["<unk>"])
return vocab
Dataloader we create with collate_fn. This function implements the logic of how to batch individual samples. When looping through PyTorch WikiText-2 and WikiText103 datasets, each sample retrieved is a text paragraph.
For instance, in collate_fn for CBOW we "say":
-
Take each text paragraph.
- Lowercase it, tokenize it, and encode it with IDs (function text_pipeline).
- If the paragraph is too short – skip it. If too long – truncate it.
- With the moving window of size 9 (4 history words, middle word, and 4 future words) loop through the paragraph.
- All middle words merge into a list – they will be Ys.
- All contexts (history and future words) merge into a list of lists – they will be Xs.
- Merge Xs from all paragraphs together – they will be batch Xs.
- Merge Ys from all paragraphs together – they will be batch Ys.
Pay attention, the number of final batches (Xs and Ys) when we call collate_fn will be different from parameter batch_size specified in Dataloader, and will vary for different paragraphs.
Code for collate_fn for CBOW is below, for Skip-Gram – is here.
import torch CBOW_N_WORDS = 4
MAX_SEQUENCE_LENGTH = 256
def collate_cbow(batch, text_pipeline):
batch_input, batch_output = [], []
for text in batch:
text_tokens_ids = text_pipeline(text)
if len(text_tokens_ids) < CBOW_N_WORDS * 2 + 1:
continue
if MAX_SEQUENCE_LENGTH:
text_tokens_ids = text_tokens_ids[:MAX_SEQUENCE_LENGTH]
for idx in range(len(text_tokens_ids) - CBOW_N_WORDS * 2):
token_id_sequence = text_tokens_ids[idx : (idx + CBOW_N_WORDS * 2 + 1)]
output = token_id_sequence.pop(CBOW_N_WORDS)
input_ = token_id_sequence
batch_input.append(input_)
batch_output.append(output)
batch_input = torch.tensor(batch_input, dtype=torch.long)
batch_output = torch.tensor(batch_output, dtype=torch.long)
return batch_input, batch_output
And here is how to used collate_fn with PyTorch Dataloader:
from torch.utils.data
import DataLoader
from functools import partial
dataloader = DataLoader(
data_iter,
batch_size=batch_size,
shuffle=True,
collate_fn=partial(collate_cbow, text_pipeline=text_pipeline),
)
I’ve also created a class Trainer, that is used for model training and validation. It contains a typical PyTorch train and validation flow, so for those who have experience with PyTorch, it will look pretty straightforward.
If you want to understand the code better – I recommend you clone my repository and play with it.
Training Details
Word2vec is trained as a multi-class classification model using Cross-Entropy loss.
You choose batch size to fit into the memory. Just remember: that batch size is the number of dataset paragraphs, which will be processed into input-output pairs, and this number will be much larger.
The paper optimizer is AdaGrad, but I’ve used a more recent one – Adam.
I’ve skipped the paper part with Hierarchical Softmax and used just plain Softmax. No Huffman tree was used either to build a vocabulary. Hierarchical Softmax and Huffman tree – are tricks for speeding up the training. But PyTorch has a lot of optimization under the hood, so training is already fast enough.
As recommended in the paper, I’ve started with a learning rate of 0.025 and decreased it linearly every epoch until it reaches 0 at the end of the last epoch. Here PyTorch LambdaLR scheduler helps a lot; and here is how I used it.
The authors trained the model for only 3 epochs in most experiments (but on a very large dataset). I’ve experimented with the number smaller and larger and decided to stick with 5 epochs.
For WikiText-2 dataset:
- My vocabulary size was about 4k words (these are words that occurred in the text at least 50 times).
- Training took less than 20 minutes on GPU for both CBOW and Skip-Gram models.
For WikiText103 dataset:
- My vocabulary size was about 50k words.
- The model trained overnight.
Retrieving Embeddings
The full procedure is described in this notebook.
Word embeddings are stored in the Embedding layer. Embedding layer size is (vocab_size, 300), which means there we have embedding for all the words in the vocabulary.
When trained on the WikiText-2 dataset both CBOW and Skip-Gram models have weights in the Embedding layer of size (4099, 300), where each row is a word vector. Here is how to get Embedding layer weights:
embeddings = list(model.parameters())[0]
And here is how to get words in the same order as in the embedding matrix:
vocab.get_itos()
Before using embedding in your model, it’s worth checking whether they were trained properly. There are several options for that:
- Cluster word embeddings and check if related words form separate clusters.
- Visualize word embedding with t-SNE and check whether similar words lie close to each other.
- Find the most similar words for a random word.
Visualization with t-SN
You may use sklearn t-SNE and plotly to create a 2-component visualization like the one below. Here numeric strings are in green and they form 2 separate clusters.
After zooming in, we may see that this division of numeric strings into 2 clusters makes much sense. The top left cluster is formed from years, while the lower right – from plain numbers.
You may explore this plotly visualization to find even more interesting relations. And the code for creating this visualization is here.
Similar Words
Word similarity is calculated as cosine similarity between word vectors. The higher the cosine similarity, the more similar words are assumed to be, obviously.
For instance, for the word "father" here are the most similar words in the vocabulary:
#CBOW model trained on WikiText-2
mother: 0.842
wife: 0.809
friend: 0.796
brother: 0.775
daughter: 0.773
#Skip-Gram model trained on WikiText-2
mother: 0.626
brother: 0.600
son: 0.579
wife: 0.563
daughter: 0.542
Code for finding similar words is here.
King – Man + Woman = Queen
According to the paper, a properly trained word2vec model can solve equations "king – man + woman = ?" (answer: "queen"), or "bigger – big + small = ?" (answer: "smaller"), or "Paris – France + Germany = ?" (answer: "Berlin").
Equations are solved by performing mathematical operations on the word vectors: vector("king") – vector("man") + vector("woman"). And the final vector should be the closest to the vector("queen").
Unfortunately, I could not reproduce that part. My CBOW and Skip-Gram models trained on WikiText-2 and WikiText103 are not able to catch this kind of relation (code here).
The closest vectors to vector("king") – vector("man") + vector("woman") are:
#CBOW model trained on WikiText-2 dataset
king: 0.757
bishop: 0.536
lord: 0.529
reign: 0.519
pope: 0.501
#Skip-Gram model trained on WikiText-2 dataset
king: 0.690
reign: 0.469
son: 0.453
woman: 0.436
daughter: 0.435
#CBOW model trained on WikiText103 dataset
king: 0.652
woman: 0.494
queen: 0.354
daughter: 0.342
couple: 0.330
The closest vectors to vector("bigger") – vector("big") + vector("small") are:
#CBOW model trained on WikiText-2 dataset
small: 0.588
<unk>: 0.546
smaller: 0.396
architecture: 0.395
fields: 0.385
#Skip-Gram model trained on WikiText-2 dataset
small: 0.638
<unk>: 0.384
wood: 0.373
large: 0.342
chemical: 0.339
#CBOW model trained on WikiText103 dataset
bigger: 0.606
small: 0.526
smaller: 0.273
simple: 0.258
large: 0.258
Sometimes, correct words are within the top5, but never the closest ones. I assume there are two possible reasons for that:
- Some error within a code. To double-check I’ve trained word2vec with Gensim library and on the same datasets – WikiText-2, WikiText103. Gensim word embeddings are also unable to solve these equations.
- So more probable reason is that the datasets are too small. WikiText-2 contains 2M tokens and WikiText103 has 100M tokens, while the Google News corpus used in the paper contains 6B tokens, which is 60 (!!!) times larger.
Dataset size really matters while training word embedding. And authors also mentioned that in the paper.
What’s Next?
I hope this post helps you to build a foundation in Deep NLP, so you can move on to more advanced algorithms. For me, it was very interesting to dig deep into the original paper and train everything from scratch. Recommended.
Originally published at https://notrocketscience.blog on September 29, 2021. If you’d like to read more tutorials like this, subscribe to my blog "Not Rocket Science" – Telegram and Twitter.