A mathematical introduction to word2vec model

Andrea C.
Towards Data Science
6 min readSep 10, 2019

--

Image by Gerhard Gellinger from Pixabay

As part of a NLP project I recently had to deal with the famous word2vec algorithm developed by Mikolov et al. in 2013.
On the web there are a lot of tutorials and guides on the subject, some more oriented to theory, others with examples of implementation.
However, I was interested not only in learning the theory behind this method, but also in understanding the implementation details of the algorithm, so that I could write and correctly reproduce the algorithm on my own (here). In this sense, I couldn’t find something that really helped me, so I have prepared a couple of articles so organized: in this one I deal with the theory of model, while here I show instead how to concretely train the model through a neural network.

Note: I find it quite unsatisfactory and annoying to use latex on Medium, so I have chosen to use hand-drawn formulas (and schemes). I apologize in advance for my calligraphy.

Word2vec has become a very popular method for word embedding. Word embedding means that words are represented with real-valued vectors, so that they can be handled just like any other mathematical vector. A transformation from a text, string-based domain to a vector space with few of canonical operations (mostly, sum e subtraction).

A change of domain

Why word embedding?

With word embedding we can mathematically process the words of a text. If two words have a similar meaning, their vectors are close: word embedding is a measure of similarity, which makes it possible to relate not only words, but also sentences and whole texts. This is very useful in various NLP tasks and systems, such as information retrieval, sentiment analysis or recommender systems.

The papers

In 2013, Mikolov et al. proposed two seminal papers [1] [2] and two different models for word embedding, the continuous bag-of -words (CBOW) and the skip-gram model. They are not so different in substance, and in the following discussion I will focus to the latter one.

The assumption for both models is the Distributional Hypothesis: words that occur in the same contexts tend to have similar meanings (Harris, 1954). That is, a word is characterized by the company it keeps.

Suppose our text is composed by a a sequence of words:

Text as a word sequence of words

Then for word wⱼ, context of wⱼ is given by its left and right neighborhood:

where M is the half-size of context window.

Then, to each word w it is assigned a vector representation v, and the probability that wₒ is in the context of wᵢ is defined as the softmax of their vector product:

Output probability is given by softmax of vector product

The objective of skip-gram model is to predict the context of central words. Training the model means therefore to find the set of v which maximize the objective function:

Objective function

Equivalently, this means to minimize the loss function, which appears to be the cross-entropy averaged over the corpus (since the true distribution is equal to 1 only in the context of central word):

So far, this is all we need to know about vanilla wordvec.

However such an approach, which requires to consider all the probabilities of the context of each word, is not analytically treatable with a decently large corpus, which usually includes billions of words. Therefore Mikolov’s group has proposed some effective measures to increase the computation efficiency. And surprisingly enough, they also improve the embedding results. Let’s see them.

Word sub-sampling

This is quite simple. Instead of considering all the words in the text, in a flat way, we assign to each word a probability, inversely proportional to its frequency of occurrence. The keeping probability is given by:

During the generation of training data, therefore, the common words will be more likely to be discarded, thus reducing the size of data to process. Since this kind of words often brings a limited amount of information (think about words like “the”, “of”, “to”), removing some of them from corpus also improves the quality of text and allows to focus on more meaningful terms.

The formula shown here for the keeping probability is included in the Google code implementation and it slightly differs from the one reported in the original article. Thanks to Chris McCormick for having pointed it out³.

Negative sampling

This technique, conversely, seems a little more obscure but it has proven to work very well.

Negative sampling considers only few samples for the evaluation of skip-gram probability. Samples are called negative because they are words which do not belong to context of wᵢ (and to which, therefore, the model should ideally assign a probability of zero).

Negative sampling derives from Noise Contrastive Estimation (NCE) technique:and the idea is basically to transform a multinomial classification problem to a binary classification problem:

  • Multinomial classification problem. The model estimates a true probability distribution of the output word using the softmax function.
  • Binary classification problem. For each training sample, the model is fed with a positive pair (a center word and another word that appears in its context) and a small number of negative pairs (the center word and a randomly chosen word from the vocabulary). The model learns to distinguish the true pairs from negative ones.

With negative sampling the objective functions becomes:

Objective function with negative sampling

where σ is the logistic (sigmoid) function:

Sigmoid function

Unlike softmax, sigmoid is a single-argument function and therefore does not require all output values to be evaluated at the same time.

Why does negative sampling works so well?

The question is still under discussion. The distributional hypothesis appears to be correct and negative sampling really exploits it by increasing the similarity between central and context words. Goldberg and Levy⁴ argue that subsampling could also have a beneficial effect. In fact, by eliminating frequent words from the corpus, we are selectively extending the size of context and we could therefore grasp relationships with informative words that would otherwise remain outside the window.

--

--