The world’s leading publication for data science, AI, and ML professionals.

Machine Learning on Graphs, Part 2

Node embeddings.

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

Node embeddings.

In the previous blog post in these series I presented basic graph statistics that should be collected before starting to develop machine learning models for your graph data. In the second post I will present the first proper machine learning approach. Namely, learning to represent discrete objects like graph nodes by continuous vectors, the so called node Embeddings. Similarly to word embeddings, the node embeddings can be then used in downstream machine learning applications.

Before presenting algorithms for learning node embeddings, let me elaborate a bit on my primary objective for the next posts.

Motivation. Graphs are ubiquitous. From social networks to traffic hubs, graphs model relationships between objects. In a sense, graphs generalize sequential data to more complex interactions. Not surprisingly, learning from graph data has become a popular topic both in academia and industry. And given the overall dominance of deep learning, the most powerful algorithms for learning from graph data are based on graph neural networks, a special kind of neural networks specifically designed to deal with graph data. In academia, where the objective is to publish research papers in highly competitive venues, most new works about graph learning are based on graph neural networks because they improve upon the best known results on benchmark datasets. And the focus in most courses for Graph Machine Learning is on graph neural networks. For example, consider this excellent Stanford course on machine learning for graphs. The course material is of very high quality and I highly recommend it to everyone who wants to go deeper into graph machine learning. However, about 70% of the lectures are about graph neural networks. And more basic techniques like graph kernel machines are omitted. From my experience and observations, this creates the impression that the default approach to learning from graphs are graph neural networks.

On the other hand, it is common knowledge that when approaching a new problem one should start with a simple solution with well-understood properties. And for many practitioners and machine learning enthusiasts, the primary objective when dealing with a new problem is to first try out a few things before investing too much time into an advanced solution. This is particularly true for graph related machine learning where we often have to solve a problem for which no explicit graph is given but we can create our own graph using some meta-information about the data. For example, we are given a road traffic dataset for a big city and we want to predict when and where there will be traffic jams. We might want to try out a graph approach by creating nodes corresponding to street locations and connect them by edges that show how traffic jams propagate from node to node. But of course there is no guarantee that the graph approach is promising, hence we rather want a proof of concept than a full-fleshed solution.

And a neural network, even a simple one, is not a simple solution. This is even more true for graph neural networks since graphs do not have a total order of their nodes and therefore exhibit a much more complex structure than sequential or grid structured data like text and images.

Therefore, in the next few blog posts I want to concentrate on techniques for graph Machine Learning that are not based on graph neural networks. I will present some simple and intuitive approaches to machine learning from graphs.

What should we expect from a simple graph machine learning algorithm?

  • Easy to implement. We want a proof of concept that graph machine learning is a promising direction for our problem and we should be able to implement a solution easily.
  • Efficient. The model should be able to provide first results in reasonable time even without access to expensive computer hardware.
  • Exploits the graph. Assume we collect the node degrees as node features. This is both easy to implement and we can train an efficient algorithm. But such a solution would be graph agnostic as it disregards the graph structure, which ultimately is what we want to utilize in our model.
  • Interpretability. Even if less accessible to a human than images or text, graphs still yield some intuition into the nature of the data. Ideally, a machine learning model should be able to aid that intuition.

OK, let’s get started with node embeddings,a powerful approach to graph machine learning.

What are node embeddings?

Inspired by the success of algorithms like word2vec and GloVe for learning word embeddings from natural language data, researchers invented algorithms for learning embeddings for graph nodes. For word embeddings, we want to learn continuous vector representations such that words that appear frequently together have similar representations. It is easy to define "appear together" for text sequences, these are words that are close to each other. But graphs are more complex. Consider the example graph in Figure 1. The yellow nodes v1 to v12 are immediate neighbors of u. On the other hand, the red node z can be reached by many more paths from u and it appears to be a more central node. So, are v1 to v12 or node z more representative for u?

Figure 1. Image by author.
Figure 1. Image by author.

Random walks

Random walks are a simple method for traversing a graph. Starting from a given node, we select uniformly at random one of its neighbor, then select uniformly at random a neighbor of the neighbor, etc. Random walks are easy to implement and are intuitive.

The properties of random walks have been studied for decades and are at the heart of several important results in graph theory. They are a very powerful tool because of their capability to capture the graph structure. Highly connected or easily reachable nodes are more likely to be visited and this corresponds to our intuition of importance.

More formally, for a connected graph G let A(G) be its adjacency matrix where A[u, v] = 1 if and only if there is an edge between nodes u and v (We assume nodes are integer numbers between 1 and n.) From the adjacency matrix, we define the transition probability matrix as P[u,v] = A[u,v]/d(u) where d(u) is the degree of node u. P[u,v] thus denotes the probability to select node v if we are currently at node u. Observe that the probabilities in each row of P must sum to 1. Also, note that for undirected graphs the adjacency matrix is symmetric but the transition matrix is asymmetric, see the example below.

Figure 2. Unweighted graph. Image by author.
Figure 2. Unweighted graph. Image by author.
The transition matrix for the unweighted graph. Image by author.
The transition matrix for the unweighted graph. Image by author.

Given the transition matrix, a stationary distribution π is defined as:

The stationary distribution gives the probability to be at a given node in a very long random walk, independently of where we started. In fact, the stationary distribution is at the heart of Google’s PageRank algorithm. Essentially, the more likely we are to visit a page while browsing for a long time at random, the higher rank the page will have.

Random walks for weighted graphs.

Many graphs have weighted edges where weights correspond to the association strength between nodes. For example, in a graph that corresponds to all world airports, the edges can correspond to the inverse of the flying time between cities with a direct flight connection. In this case, in a random walk a random neighbor is sampled according to the corresponding edge weights. Consider the weight graph in Figure 3 below and the corresponding transition matrix. From node 1 it is most likely that we visit node 3 because the edge weight is larger. Note that the edge weight between nodes 1 and 2 is only important when we are at node 1. If we are at node 2, then the weight does not matter as the only neighbor is node 1.

Figure 3. Weighted graph. Image by author.
Figure 3. Weighted graph. Image by author.
The transition matrix for the weighted graph. Image by author.
The transition matrix for the weighted graph. Image by author.

Node embeddings

Random walks open the door to extending word embedding learning algorithms to graph data. Namely, we can create node sequences by generating random walks and feed those into a model for learning word embeddings. The implementation is simple and intuitive:

def random_walk(G, u, k):
    curr_node = u
    walk = []
    for i in range(k):
        idx = random.randint(0,len(list(G[curr_node]))-1)
        curr_node = list(G[curr_node])[idx]
        walk.append(curr_node)
    return u, walk

Once we have generated a random walk, we generate positive and negative node pairs.

def generate_pairs(G, u, walk, nr_neg_pairs):
         positive_pairs = []        
         negative_pairs = [] 
         for v in walk:
             positive_pairs.append((u,v))
         for _ in range(nr_neg_pairs):
             w = <sample a random node from G>
             negative_pairs.append((u,w))
         return positive_pairs, negative_pairs 

We then feed the positive and negative pairs into a model that learns the embeddings of the nodes. Each node u is initially represented by a random vector vec(u) that is updated whenever we see a positive or negative pair involving u. This is the standard word2vec approach with negative sampling. We train a logistic regression model where the input is the inner product of the node embeddings for positive and negative node pairs. I refer to the implementation for more details on this.

A few practical considerations:

  1. How long should be the random walks? Don’t forget the graph diameter from the previous post. For graphs of short diameter, any pair of nodes is connected by a short path. One should be careful not to use too long random walks as they might quickly converge to the stationary distribution and generate similar walks, and thus similar embeddings.
  2. Reweighting the nodes in each sequence. You might be tempted to assign different weights to different node pairs. However, note that random walks already take into account node distances. A node that is further away from our starting node is usually less likely to appear in a random walk. And if it does appear often, then it is an important node connected by many paths to our starting node. You might want to think about assigning different transition probabilities between neighbor nodes. But before doing so you should have a solid understanding of your data. So, unless you really know what you are doing, it is safest to use the standard random walk algorithm.

Structural role node embeddings.

The random walk approach assumes that nodes that are easily reachable from each other are also similar to each other. This is certainly justified for many applications. However, sometimes we are interested in the structural role a node plays in the graph, not so much about its neighbors.

Let us consider a concrete example. Some nodes can be the center of the hub like a twitter user with many followers (the red nodes in Figure 4). Alternatively, some nodes might have a low degree but they play a bridge role between communities and are thus important for tasks like influence propagation tracking (the green nodes in Figure 4). Using a standard random walk we can capture the communities to which the nodes belong but not the roles of the nodes in these communities.

We will now discuss how to learn embeddings with respect to the structural roles individual nodes have.

Imagine we have labels for the nodes. Then we generate sequences by running a random walk but instead of node ids, we collect the node labels. In this way we generate long strings and we want to force nodes with similar node strings to be similar to each other. The intuition is that nodes with similar structural roles will have similar sequences of node labels. For example, a popular music star will have many followers who are young and have interest in music. And the same pattern will hold for pop stars around the world, independently if they are in the USA, India or Russia.

Of course, now we have to generate similar and dissimilar objects in some way. A simple solution would be to define a similarity function between the generated sequences and sample nodes according to this similarity function. Of course, similarity is a very general concept, so let us consider a concrete example that will provide more intuition.

Example. Consider Figure 4. Assume the nodes are labeled according to their degree as H (high degree), M (medium degree) and L (low degree). In Figure 4I encoded the node labels by colors. Low degree nodes are blue, medium degree nodes are green, and high degree nodes are colored in red. Assume we collect random walks of length k = 5. A walk generates a sequence like L,H,M,M,M. A natural way to compare such sequences is to compare their n-gram distributions. For n=2 the grams in the walk L,H,M,M,M are (L,H), (H,M), (M,M), (M, M). The two walks L,H,M,M,M and L,H,L,H,M look similar to each other but are not identical. Counting their 2-grams we see that 2 out of the 4 2-grams are identical.

Figure 4. The colors are node labels. Image by author.
Figure 4. The colors are node labels. Image by author.

In the second part of my post on learning embeddings for arbitrary similarity functions, I discuss approaches to learning embeddings from arbitrary similarity functions. These technique apply here. Please refer to the techniques described there for some ideas how to generate samples of similar nodes in the above described setting.

How to use node embeddings?

Node embeddings are the perfect choice for node classification or regression tasks. For example, your graph is a network of cities and villages where nearby towns are connected by an edge. We want to predict the weather for each town in our network but it is too expensive to collect historic weather data for each node in our network. By setting up meteorological stations only in a subset of the towns, we create a training dataset. Using the node embeddings as input features for our problem, we can then predict the weather conditions for all towns using our favorite machine learning algorithm. And of course, we can augment the embeddings with additional data. For example, collecting data about precipitation is expensive and we settle for a subset of the towns but we can afford to collect temperature and atmospheric pressure data for all nodes. Then we can combine embeddings that capture geographic proximity between towns with additional features.

But what about if we want to use embeddings for the classification of subgraphs or even independent graphs? For example, we want to classify users according to their browsing behavior. This will generate independent graphs with nodes corresponding to visited websites and edges to hyperlinks between web pages. Can we use the individual node embeddings for each web page for the classification of these small subgraphs of the global web graph?

From node embeddings to (sub)graph embeddings.

A standard technique is to simply take the average of all node embeddings in the (sub)graph. This is a reasonable choice but we are likely to lose information in this way. Think about it in this way: individual embedding dimensions represent different features of the users. For example, the first dimension shows interests in politics, the second in music, the third in sports, the fourth in cinema, etc. By averaging the embeddings we will know the primary interests of the group. But we are likely to miss important details like that the group in fact contains two subgroups of user that like sports and fashion, and a second that is primarily interested in politics and cinema.

A solution was proposed that works by creating a dummy node for each subgraph of interest and connecting all nodes in the subgraph to it. When training embeddings, the dummy nodes will capture to some degree the structure of the subgraph. This is essentially again an averaging methods but we will pay more weight to neighborhood nodes that are more likely to be visited by a random walk. A drawback of this approach is that we need to decide before training the embeddings which are the subgraphs of interest.

In the next blog post in these series I will present graph kernels, a powerful yet intuitive approach to machine learning on graphs that builds upon kernel machines like SVMs. Then I will show a simple technique for combining individual node embeddings into graph embeddings. So, stay tuned 🙂

Conclusions.

Finally, let us go back to the beginning of the post and discuss how node embeddings satisfy our requirements for a simple yet powerful machine learning models. Given the publicly available tools for training embeddings, they are easy to implement. And these tools are highly optimized and thus efficient. Random walks fully exploit the graph structure. One can even argue that they are the most powerful tool for graph analysis as they are naturally biased towards easily accessible nodes. The embedding vectors are not directly interpretable. But they model node similarities that are themselves interpretable and, as discussed above, can be user-defined. So, training node embeddings for your graph and then using them in combination with a well-known ML algorithm is a natural first step to deal with graph data.


Code.

The code for training node embeddings and using them in a node classification problem can be found here.

The code for the series is here: https://github.com/konstantinkutzkov/ML_on_graphs/


Bryan Perozzi, Rami Al-Rfou, Steven Skiena: DeepWalk: Online Learning of Social Representations. KDD 2014.

Aditya Grover, Jure Leskovec: node2vec: Scalable Feature Learning for Networks. KDD 2016

Yujia Li, Daniel Tarlow, Marc Brockschmidt, Richard Zemel. Gated Graph Sequence Neural Networks. ICLR 2016.


Related Articles