Skip-Gram Neural Network for Graphs

Yves Boutellier
Towards Data Science
8 min readJul 23, 2021

--

This article will go into more details of node embeddings. If you lack intuition and understanding of node embeddings, check out this previous article that covered the intuition of node embeddings. But if you are ready. We can start.

Photo by Braden Collum on Unsplash

In the level-1 explanation of node embeddings I motivated why we need embeddings such that we have a vector form of graph data. Moreover, I said …

Embeddings should capture the graph topology, relationships between nodes and further information. How are we going to grasp that in order that we have a clear procedure?

In the last article I suggested …

A possible shortcut to this problem is to try to form embeddings such that node embeddings of two nodes are similar in some sense, if they happen to have some similarity in the network.

Okay to build this up more formally, let’s recap what we have touched on. We said we have or want some embedding, but the only thing we have are nodes from a network. Thus we need a transformation or function. Functions in the context of embeddings are called encoders.

Let f denote an encoder that makes a mapping from a node v to the embedding z, where z is a vector in the d-dimensional embedding and v is a vertex in our graph G.

mapping from nodes to embedding space via an encoder, by author

From my experience, this function f can be initially hard to grasp. Especially if the random walk approach from deepwalk or node2vec is the first embedding strategy you hear. In both the approaches, which use random walks, is f just the look-up of a row or column in a the embedding matrix since each node v is represented by a row or column.

Next — The mystery about similarity

In the previous article and in the introduction to this article I mentioned the similarity that should be captured. The similarity helps us to build embeddings such that the embedding reflects information of the real world network. Again let’s wrap that up more formally.

The decoder is a function that maps from the embeddings of two nodes enc(v) and enc(v) to the similarity score:

Decoder

A possible decoder is the dot product which represents the cosine-distance.

The dot product is 1, if the two vector point in the same direction and is 0 if the two vectors are perpendicular to each other, and -1 if they point in opposite direction.

However, other distance metrics / decoders allow to compute similarity values as well.

When I first saw the lecture of node embeddings by Jure Leskovec, I didn’t manage to see how that is a part of node2vec or deepwalk at all. And to my understanding the decoder is very implicit in the algorithm of both approaches, but more on that later. And still the decoder allows assessing the similarity values in the embedding space between two node representations.

Similarity of nodes in a network

There are many possibilities to define similarity. A possible selection is given below.

Similarity can mean that two nodes …

  • are linked
  • share neighbors
  • have similar “structural roles”
  • co-occur in random walks

Let’s focus on random walks a bit more

Why random walks?

Photo by Andrew Gook on Unsplash

But why should a random walk and co-occurence help to describe similarity?

See it this way, if a random walker moves from node u to node v with a high probability, then are u and v similar in a sense that they belong to the same neighborhood or structure of the network.

And why is a random walk approach a good idea? It’s efficient. You can easy parallelize a random walk with many walkers. Moreover, you don’t need to consider all possible pairs of your network that possibly contains millions of nodes and yet it describes proximity very well.

But more importantly..

The work of natural language processing by researchers from Google (Mikolov et al) that introduced word2vec inspired the research group at the Stony Brook University at the Department for Computer Science in New York to try something similar. Sentences could be mimicked by sequences of nodes that come from a random walk and they came up with the algorithm called deepwalk.

Sequence of nodes mimic sentences that are needed for the skip-gram model

Maybe you are still a bit sceptical why this should work. Authors of deepwalk motivated that with the following:

If the degree distribution of a connected graph follows a power law (is scale-free), we observe that the frequency which vertices appear in the short random walks will also follow a power-law distribution. Word frequency in natural language follows a similar distribution, and techniques from language modeling account for this distributional behavior. To emphasize this similarity we show two different power-law distributions in Figure 2. — Perozzi et al 2014, Deepwalk

Deepwalk, Perrozzi et al 2014

Deep walk in turn inspired researchers at Stanford to improve this algorithm and their result is called node2vec, which they published two years later after the publication of deepwalk in 2014.

How did word2vec lead to deepwalk?

The group at Google managed to achieve word embeddings based on a model called skip-gram. They looked at a text and extracted the contexts of a word, which were basically sentences. And the skip-gram model should predict based on each word the context of that word. For example, which words are probable to occur next to house? Door, floor, kitchen etc.

We build up this skip-gram model and subsequently we see how it translates to deepwalk.

First we need a sentence from a whole text.

Next a window is applied and moves over the text in order to create training samples. In the image below I chose a window of size two. This means we have two words in front and two words after our word of interest (highlighted deep blue). The neighborhood of the word, words that are part of the window are lightly blue colored. The set of words in the neighborhood of a word is also called Corpus.

training samples for skip-gram model for NLP, by author

Now that we have training samples we need a model. Let us assume we train the pair (“example”, “silly”).

Our input is an one-hot encoded vector that has a length equal to the length of the vocabulary V. The entry with value one is at the position where we encode the word “example”. Then we multiply this vector with an input-weightmatrix. But basically, multiplying a matrix with a one-hot encoded vector results in selecting a column or row vector. This column or row of matrix W_{in} is the embedding vector we are looking for (indicated with blue vertical line).

This vector, we call it now h is subsequently multiplied with a second weightmatrix, W_{out}. The product is passed into a non-linear activation function. In this case it’s the softmax because we want to model a probability distribution over the output words. In our case, the input word is “example” and we compute the probability for each word in the vocabulary to find it in the corpus of “example”. This process is supervised. Meaning that we want that the input “example” gives a probability for “silly” that is 1, since it’s what our data suggested. We found “silly” in the neighborhood of “example”.

Representation Learning in Skip-Gram

If the embedding vector h and the weight matrix W_{out} don’t provide an input to the softmax that results in 1, we need to change both according to their contribution. This proportional change is achieved with Stochastic Gradient Descent.

The difference between the prediction for word “silly” and the real probability = 1 leads to change in the embedding vector for the word “example”. Such that “example” better predicts “silly”.

transfering this knowledge to graphs..

Going back to our main interest graph embedding we can draw certain analogies. I already pointed out that the researchers from New York saw a corpus or a sentence as random-walk sequence or neighborhood. The rest of the table complements the other analogies.

Random Walker in Action

This animation shows how a random walker explores the network. As a result a sequence of nodes is stored. This sequence mimics our sentence, to be able to use the skip-gram model.

random walker

Approach in Deepwalk

In the deepwalk publication they exactly follow this analogy and used the work of Google Researchers in the field of word embedding for graph embedding. That’s nice! I love to see that different groups inspire each other. So don’t be shy to see what researchers in other fields are trying to solve. Maybe you achieve a breakthrough with inspiration from them.

Let’s go over the algorithm step by step. After random initialization of the embeddings and randomly choosing a node. We are going to perform a random walk. Subsequently, we will slide over the sequence with a window. In our example the window size was two. This generates pairs of nodes.

As a next step, the generated pairs of nodes will be passed into the skip-gram model. This drawing shows an example of training the pair (v_5, v_8)

skip-gram for networks, by author

In our example the embedding h together with the Weightmatrix W_{out} did not contribute correctly, to predict node v_8 in the neighborhood of node v_5. That’s why we need to adapt both of them with gradient descent to obtain more accurate values.

After many random walks, and gradient descent iterations, the learning process will be stopped. The matrix W_{in} which contains for every node v_i a vector h_i, represents the embedding. This embedding can be used on downstream machine learning tasks.

Summary

After a short recap from previous article that explains why we rely on node similarity in order to form node embeddings. We saw how random walks offer a solution to formulate node similarity with computational efficiency. It was a result of applying knowledge in natural language processing to networks. We revised how word2vec works and how it relates to deepwalk.

[1] Perozzi et al, deepwalk, 2014

Related Work

--

--