Node embeddings for Beginners

Published in

Towards Data Science

10 min readJul 23, 2021

The first time in University I heard about networks was in a course about Ecological Networks. This course made things crystal clear for me. I want to know more about them and I dream of a job in which I would work with networks. But I knew nothing about any possible machine learning on graphs. I decided to take responsibility and search the internet for free online courses and true treasures.

I found this free course on youtube from Stanford University, which is taught by a leading figure in graph machine learning, Jure Leskovec. The title of this course is Machine Learning with Graphs. This and other articles I am about to write about graph machine learning are my way to incorporate this new knowledge, I am gaining from this series. If you haven’t touched graph machine learning this article gives a gentle introduction and helps you to get started with his free course.

Therefore I chose to make this piece very introductive and follow up this article with a more details to the insights this article might bring to you. So, if you feel after this article leads to increasing desire of wanting more knowledge you should check it out. I hope you will enjoy.

TL;DR

ML algorithms need some form of vector input to be executed
A network consists of nodes and edges which connect the nodes
We can ask different questions about network data. E.g to which group does a node belong to, or does a link possibly exist between two nodes
Networks have no natural ordering, thus need a different approach for ML
Node embeddings are a way of representing nodes as vectors
Network or node embedding captures the topology of the network
The embeddings rely on a notion of similarity.
The embeddings can be used in machine learning prediction tasks.

The purpose of Machine Learning — What about Machine Learning on graphs?

Like for other types of data, we want to make inferences about the underlying information that is captured by the data which was obtained in the real world. Here is where Machine Learning comes into play.

Machine Learning algorithms provide us with models that help us to understand data in our case networks we have not seen before previously. When I say understand networks, what do I mean?

What questions can we ask about networks?

Let’s make an example network and see what kind of question can we ask.

This is a social network. Each node refers to a person and each link or edge means that the two people connected by this edge know each other. So for example, Joanna knows Peter and Pierre, but she doesn’t know Mary.

When examining this network we might wonder why Peter is connected to Joanna [1. ie. do they belong to the same group (label)?] or if it is possible that Peter also knows Mary as well [2.]. Or do Vincent and Joanna have similar roles in their social networks [3.] ?

Summary of possible questions (not exhaustive)

Node feature (Label/Group) prediction based on the assumption of homophily
Link Prediction
Node feature (role) prediction

How are we going to answer those questions?

Let’s move a step back and recall how we answer questions with other type of data. I provide a short summary of possible problems and the approaches just to familiarize ourselves with the general pattern of problem solving.

Linear Regression

If we want to predict a price for a house, we gather attributes or features and put them into a vector. This vector multiplied with a weight/slope and adding an intercept gives us a price prediction we need. The vector has for every house the same features and in the beginning a ordering of those features is decided. Thus the first entry does always describe the zip-code for example, the second one the age of the building and so on. The idea that the features always get the same position is natural that most of us that have data science / ML experience haven’t thought about.

CNN

If we want to classify images we might choose Convolutional Neural Networks (CNN). Images are basically arrays/matrices of pixel values. CNN’s algorithm classifies the picture based on those values. The whole array is transformed over multiple steps to preserve important information (which pixel has which value and how this value relates to neighboring values) and finally passed into a fully connected neural network to weight the relationships between the pixels. But the most important thing in context of this article is again, the nature of ordering. Each image has a top and bottom, left and right. Thus before the array (or more correctly the stacked arrays (a tensor)) is passed into the fully connected neural network, the pixel values are being put into a vector. This unfolds for every picture the same, due to the natural ordering of pictures.

RNN

In all flavours of recurrent neural networks (RNN) we pass in vectors that reflect measurements/values from a time series or a sequence. Every input has the same form and the network learns relationships between exact those entries of this form and predict future values.

Coming back from those three examples back to the overarching principle of those problems, we can conclude that it’s necessary to have some form of vector that describes the data source uniquely whether it is a house, an image or a sequence. Additionally the examples show that the data must underlie some ordering.

The problem with graphs..

Graphs do not have a natural order or reference point. Thus there does not exist a top and a bottom or a left and right. This is different compared to the type of data we can explain with linear regression, CNNs or RNNs. A picture consists of a regular lattice. RNNs learn sequences of well ordered vectors. Even arbitrary entities like a house can be abstractly described by a vector, even when it’s not fully capturing the reality.

But any attempt to count through the network and making a matrix fails to generalize patterns in networks. Since adjacency matrices for a given graph are not unique.

graphs don’t have unique adjacency matrices, by author

But still we want to make inferences about graphs

Still, we need a procedure that returns a vector, either for a graph, a node or an edge. Luckily, those procedures are already invented. For the rest of the article I will focus on node embeddings, but similar approaches work for entire graphs or edges.

Node representation learning

Finding a vector for a node is possible with node representation learning. The result of such representation learning is a node embedding (and so does graph embedding, edge embedding exist). So I will use the terms representation and embedding interchangeably.

Node embeddings

But what rules should node embeddings obey in order to allow inference?

Embeddings should capture the graph topology, relationships between nodes and further relevant information. How the embeddings should capture this inherent information of the graph is not fixed. It depends on the questions we ask about the network.

Embedding should capture the graph topology, relationships between nodes and further information. How are we going to grasp that in order that we have a clear procedure?

A possible shortcut to this problem is to try to form embeddings such that node embeddings of two nodes are similar in some sense, if they happen to have some similarity in the real network. Whatever this similarity might be, it’s upon us to decide.

We can decide to form embeddings based on the principle of similarity. Nodes that are similar in the network will have similar embeddings.

If we go back to the introductary network example where we have a social network and we want to assign the different people to groups (that was one possible question), we can try to form embeddings that help us to do so. One concept that was found to be useful is called homophily. In many real world networks homophily was found to be an organising principle, especially in social networks. That said, we want to group people together that spend time together.

💡 So the vector representation of Joanna needs to be similar to the vector representation of Peter, since they are neighbors. — Homophily

Again note this is only true if we learn a feature representation that follows the heuristic concept of homophily. Another possible concept that researchers found to be an organising principle is the notion of structural equivalence. A hub is connected to bridges and/or leaves.

💡 So the vector representation of Joanna needs to be similar to the vector representation of Vincent, since they have the same role in their peer groups. — Structural Equivalence

To make things crystal clear for you, I drew the two scenarios. In the left you see an embedding that would result from pure homophile organisation and on the right if the it was organised by structural equivalence alone. I schematically drew the 2D projection of the embedding space, since that is how embedding spaces are usally visually assessed.

Embedding plot after dimensionality reduction /w e.g. t-SNE, by author

Please note that for example the embedding values are really similar for Joanna and Pierre if similarity is based on homophily but very different if it’s based on structural equivalence. There it’s Vincent that has a similar embedding to Joanna, since both connect the rectangle peer group with a triangle peer group.

However, most algorithms let this unfold more unsupervised, networks can be structured based on principles we do not know. Remember we want to capture the network topology and the relationships within this network. Therefore most often the representation should not fully follow homophily or structural equivalence. However, research has shown that those organising principles do exist in real networks, thus it’s reasonable to assume that the embedding incorporates that.

In my other article that goes into more depth about node embeddings, you can see that an algorithm called node2vec managed to have a generic notion of similarity to achieve node embeddings that could follow either more the concept of homophily or the concept of structural equivalence or a mix of both, while Deepwalk, a “precursor” of node2vec, was more restricted and is thus less powerful in its application.

But

node similarity in node2vec and deepwalk both relies on node co-occurrence in random walks.

figure 3 from node2vec, Grover et al, 2016, showing the co-appearances of characters in les misérables

In this figure above from the node2vec paper we can see that they labelled the nodes after clustering and grouping the nodes in the embedding space in an unsupervised manner (no groups were known to the algorithm). Fully homophily based led to six groups (top), fully structural equivalent based led to three groups (bottom).

To summarize what we’ve learned so far in this chapter, we know now that we want to form embeddings such that they reflect the information that comes from the network. To approach this vague goal we introduced the notion of similarity. We touched the idea of homophily and structural equivalence, but we learned that a complex mix of both and other organizing principles exist.

But how do we obtain those embeddings?

To begin with, we have to choose a size of our embedding space. In node2vec for example with les misérables network they decided to embed the network in 16 dimensions. Then, since we have no starting hypothesis how the embedding space will look in the end (unsupervised — we don’t make use of any labels or whatsoever) we initialize the embeddings randomly. We therefore conduct representation learning, which leads over a couple of iterations to a good embedding which reflects the topology of the network.

During representation learning we follow our goal, for which we built intuition before. We measure somehow the similarity in the network between two nodes, compare it to the similarity in the embedding space. If the similarity in the embedding space between two nodes does not reflect the similarity in the network, we make adjustments to the embeddings.

I omitted the details of representation learning on purpose, since this article is about building a strong intuition.

Summarized

decide how large is the embedding space
randomly initialize embeddings for each node/graph/edge
learning the embeddings by repeatedly incrementally improve the embeddings such that it reflects the similarity in the network

(The 3rd step is a loop.)

So what? What can we do with the node embeddings?

If our algorithm learned the embedding, we can use the vectorized data in in order to gain insight about the network. This is achieved with known machine learning tools. For example we can form unsupervised groups within this embedding space k-means clustering.

A proof that a embeddings based on the algorithm leads to representation of the graph is shown in the image below from the deepwalk paper. The embeddings on the right reflect the groups that are part of the network on the left.

Summary

You learned which questions we can ask about networks with the machine learning/data science perspective. But you saw that approaches that apply to other common data types like images, or sequences do not work. However, you learned that node or more general graph embeddings can be the first step to solve machine learning problems on graphs. We have introduced the notion of similarity that should help obtain embeddings. If we have the embedding, you saw how the original graph data can translate into ML applications.

[1] Grover, Leskovec, node2vec, 2016

[2] Perozzi et al, deepwalk, 2014

[3] machine learning for graphs — youtube course from stanford online https://www.youtube.com/watch?v=3IS7UhNMQ3U&list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn&index=4

Related Work

Graph Coloring with networkx

The solution to the graph coloring problem is conceptually easy but powerful in its application. This tutorial shows…

towardsdatascience.com