Thoughts and Theory
Representing discrete objects by continuous vectors, the so-called embeddings, has been at the heart of many successful machine learning solutions. The superiority comes from the fact that, unlike the original discrete objects, the embedding vectors offer a compact representation that captures the similarity between the original objects.
In this article, we consider the famous word2vec algorithm. Word2vec is simple and intuitive. At a high level, it says that words that appear frequently close to each other should have a similar vector representation. In particular, the example
embedding(man) – embedding(king) ~ embedding(woman) – embedding(queen)
has become the poster child for the ability of word embeddings to capture word semantics.
However, the optimization objective cannot be presented by a well-defined quantity. For comparison, consider learning word embeddings using matrix factorization. Let D be a text corpus consisting of m documents and a vocabulary of n unique words. We compute the n-times-m word, document matrix M, where M[u,v] records how many times word u occurs in document v, see Figure 1.
The matrix factorization is defined as
In the following, we will slightly abuse notation and denote by u both a word u and its embedding vector.
In this case, we know that for a word embedding u, and a document embedding v the inner product between u and v preserves the information how many times the word u occurs in the document v. The larger the embedding dimensionality d, the better the approximation. Unfortunately, there is no such clear formulation of the optimization objective for the word2vec model. What exactly does the inner product of two word vectors in word2vec preserve? And do the embeddings necessarily become better by increasing the dimensionality d?
A research paper by Levy and Goldberg answers exactly this question [1]. In this article I present the theoretical results from [1] and later show how they can be used to design a more general class of embeddings.
Training word embeddings: word2vec
Let us briefly consider how word2vec with negative sampling works. For a more comprehensive description, we refer to this article. Let D be the corpus consisting of word, context pairs. In word2vec the context of word w is defined as the k words surrounding w where k is usually a small constant varying between 5 and 15. We want to learn word embeddings such that if two words frequently co-occur in the corpus, their inner product is large.
Consider a word w and let c be a word in its context. For word pairs (w,c) occurring together in the corpus, we want the inner product of the embeddings to maximize the probability that (w,c) indeed appears in the corpus (denoted as D=1). The probability is modeled by a sigmoid function:
The above has a trivial solution, we can simply make all inner products arbitrary large. Thus, we also introduce negative pairs, i.e. pairs that do not co-occur in the corpus, for which the objective is:
The algorithm can be summarized as follows:
Algorithm word2vec
1. Assign a random d-dimensional vector to each word that appears in the corpus.
2. Traverse the corpus and generate pairs of words that appear in it. These are the positive pairs.
3. For each positive pair, generate k random pairs of words. These are the negative pairs.
4. Feed the inner products of the embedding vectors of positive and negative pairs into a binary classification model where the learnable parameters are the word embeddings.
We run the above algorithm for several epochs over the corpus in order to guarantee that the learning process converges in an optimum.
The theoretical analysis
Fix a word w and consider the objective for all pairs in which w appears. Let #(w,c) be the number of appearances of the pair (w,c) in the corpus. We can write the objective as
where the second product is over the negative pairs we generate.
By taking the logarithm of the objective and observing that each negative word cN has a chance to be sampled, we obtain:
Let us explain the above. The word w is fixed, we consider all word context pairs (w,c) that appear in the corpus and we sample k negative pairs (w, cN) such that each word c is sampled with probability #(c)/|D|. We want for positive pairs the inner product to be a large positive number. For negative pairs we want the inner product to be a negative number with large absolute value.
Observe that |D|, the number of pairs in the corpus, is constant. Thus, by dividing the above expression by |D| the objective becomes
This already provides us with some intuition. The objective is to optimize the embeddings such that they reflect the probability for a positive pair to be sampled as opposed to a pair being sampled at random. Positive pairs are generated with probability
And for negative pairs, two words are sampled independently at random, each with probability
By setting the inner product as an unknown parameter and solving the corresponding optimization problem, we can find the optimal value for the inner product:
In the above P(w,c) is the probability of occurrence of the pair (w,c), and P(w) is the marginal probability of occurrence of word w in the corpus. The above turns out to be a widely used word association measure in natural language processing, the _pointwise-mutual information (PMI)_ measure.
This is pretty amazing! It turns out that word2vec is essentially equivalent to matrix factorization where the matrix entries are the PMI scores between word pairs. And PMI as a distance measure was used for NLP-related tasks since the 80s [2], long before the emergence of the concept of word embeddings.
Embeddings based on arbitrary similarity functions
Now it is easy to see that we can simply replace the probability for sampling positive and negative pairs. We only need to update the second and third steps in the Word2vec algorithm presented above:
Algorithm sim2vec:
1. Assign a random d-dimensional vector to each word that appears in the corpus.
2. Generate pairs of words according to a words similarity measure. These are the positive pairs.
3. For each positive pair, generate k random pairs of words by the independent sampling of word pairs. These are the negative pairs.
4. Feed the inner products of the embedding vectors of positive and negative pairs into a binary classification model where the learnable parameters are the word embeddings.
Why is this helpful? This gives us more freedom to assign importance to pairs. We can become creative and consider different similarity measures. For example, the Jaccard similarity between words is defined as follows:
Thus, we can learn embeddings that optimize the objective that words w and c are similar to each other if the presence of w implies that it is likely that c also appears in the document, and vice versa. In this case, the pair ("keira", "knightley") will likely have a higher score than ("data", "science"). The objective becomes:
And we can also model the probability for generating negative pairs. For example, Pr(w) can be the uniform distribution where all words have the same probability of being selected, disregarding how often they appear.
Sampling from a distribution
If we could compute and store the similarity for all pairs (u, v), then sampling according to the similarity becomes trivial: just store the pairs with their similarity scores as weights and sample using an algorithm like numpy.random.choice. However, this might be computationally infeasible.
There are different approaches to deal with the problem with a larger number of pairs. In general, we want to use as positive pairs only those that have a high similarity score.
- If your similarity measure is based mainly on counts, then a subsample of the data will preserve the most frequent pairs but many infrequent pairs will be filtered out. For example, we can consider only a subset of the documents in a corpus. Frequent word pairs such as ("data", "science") will likely survive. But this might not be the case with ("keira", "knightley").
- For each object, consider only the t nearest neighbors. For example, we might use the publicly available implementation from scikit-learn which uses algorithms like kd-trees to speed up similarity search. These algorithms work well for data that is not very high dimensional. Otherwise, one can consider approaches such as Locality-sensitive hashing that will generate similar words. This is especially true for measures like Jaccard similarity.
A practical implementation
For illustrative purposes, we implemented a simple solution for learning document embeddings from text corpora. The problem is orthogonal to the problem of training word embeddings: we train vector representations for documents based on the words they contain.
We consider the IMDB sentiment analysis dataset. The dataset consists of movie reviews by users and each review is labeled with a positive or negative sentiment.
- After preprocessing the text, we transformed the documents to vectors by using a tf-idf encoding such that each document
from sklearn.feature_extraction.text import TfidfVectorizer
tfv=TfidfVectorizer(min_df=0.00,ngram_range=(1,1))
tf_reviews = tfv.fit_transform(reviews)
The parameter min_df says we consider only words that appear in at least 0.1% of the documents. Essentially, this prevents us from using very specific words that might appear just in a couple of documents.
- For each input vector, find its t nearest neighbors. This can be achieved using an off-the-shelf package such as scikit-learn’s K-NearestNeighbor which returns nearest neighbors for:
from sklearn.neighbors import NearestNeighbors
nn_model = NearestNeighbors(n_neighbors=t, algorithm='auto')
nn_model.fit(tf_reviews)
distances, indices = nn_model.kneighbors(tf_reviews[idx])
- Compute the similarities for the generated n*t positive pairs, sort them in an array, and sample according to their weight using numpy.random.choice():
eps = 1e-3
weights = [(1.0/(d+eps))**2 for d in distances[0][1:]]
weights = [w/sum(weights) for w in weights]
pos = np.random.choice(indices[0][1:], nr_pos_samples, p=weights)
neg = np.random.choice(nr_reviews, nr_neg_samples)
- Use a Keras generator to generate positive and negative pairs:
class PairsGen(tf.keras.utils.Sequence):
def __init__(self, data, nr_pos, nr_neg, ...):
"""Initialization
:param data: the sparse matrix containing
:param nr_pos: the number of positive samples per word
:param nr_neg: the number of negative samples per pos. pair
...
"""
...
def __getitem__(self, idx):
"""
Sample self.nr_pos word pairs and self.nr_neg word pairs
"""
.....
- Feed the generated pairs into a shallow neural network with an embeddings layer, a Dot layer computing the inner product, and an output layer with a sigmoid activation function:
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) [(None, 2)] 0
__________________________________________________________________________________________________
context_embedding (Embedding) (None, 50) 2500000 tf.__operators__.getitem_3[0][0]
__________________________________________________________________________________________________
word_embedding (Embedding) (None, 50) 2500000 tf.__operators__.getitem_2[0][0]
__________________________________________________________________________________________________
dot_1 (Dot) (None, 1) 0 context_embedding[0][0]
word_embedding[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 1) 0 dot_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1) 2 flatten_1[0][0]
=====================================================================
Total params: 5,000,002
Trainable params: 5,000,002
Non-trainable params: 0
The above approach will train embeddings:
model.compile(optimizer='adam', loss= 'binary_crossentropy',
metrics=['accuracy', 'AUC'])
model.fit(generator, epochs=10)
Epoch 1/10
10000/10000 [==============================] - 153s 15ms/step - loss: 0.6354 - accuracy: 0.6667 - auc: 0.5359
Epoch 2/10
10000/10000 [==============================] - 152s 15ms/step - loss: 0.6129 - accuracy: 0.6868 - auc: 0.6151
Epoch 3/10
10000/10000 [==============================] - 151s 15ms/step - loss: 0.5680 - accuracy: 0.7292 - auc: 0.6963
...........
Epoch 10/10
10000/10000 [==============================] - 151s 15ms/step - loss: 0.3891 - accuracy: 0.8323 - auc: 0.8865
Then we can extract the embedding layer for each word and cluster the documents (similarly to what is shown in the gif in Figure 1). We observe that the sentiment distribution in the two clusters is very different:
Code
The Python implementation for the above is publicly available at: https://github.com/konstantinkutzkov/sim2vec
[1] Omer Levy, Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. NIPS 2014: 2177-2185
[2] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29, 1990.