Graph Neural Networks with PyG on Node Classification, Link Prediction, and Anomaly Detection

Pytorch Geometric Implementations on major graph problems

Tomonori Masui
Towards Data Science

--

Photo by DeepMind on Unsplash

Graph Neural Networks is a machine learning algorithm designed for graph-structured data such as social graphs, networks in cybersecurity, or molecular representations. It has evolved rapidly over the last few years and is used in many different applications. In this blog post, we will review its code implementations on major graph problems along with all the basics of GNN including its applications and algorithm details.

Applications of Graph Neural Networks

GNN can be used to solve a variety of graph-related machine learning problems:

  • Node Classification
    Predicting the classes or labels of nodes. For example, detecting fraudulent entities in the network in cybersecurity can be a node classification problem.
  • Link Prediction
    Predicting if there are potential linkages (edges) between nodes. For example, a social networking service suggests possible friend connections based on network data.
  • Graph Classification
    Classifying a graph itself into different categories. An example is determining if a chemical compound is toxic or non-toxic by looking at its graph structure.
  • Community Detection
    Partitioning nodes into clusters. An example is finding different communities in a social graph.
  • Anomaly Detection
    Finding outlier nodes in a graph in an unsupervised manner. This approach can be used if you don’t have labels on your target.
GNN Applications (Image by author)

In this blog post, we will review code implementations on node classification, link prediction, and anomaly detection.

Graph Convolution — Intuition

Graph Neural Networks evolved rapidly over the last few years and many variants of it have been invented (you can see this survey for more details). In those GNN variations, Graph Convolutional Networks might be the most popular and basic algorithm. In this section, we will review the high-level introduction to its algorithm.

Graph Convolution is an effective way to extract/summarize node information based on a graph structure. It is a variant of the convolution operation from Convolutional Neural Networks which is typically used for image problems.

In images, pixels are ordered structurally in a grid, and the filters (weight matrices) in the convolution operation slide over the image with a pre-determined stride size. The neighbors of a pixel are determined by the filter size (in the below image, the filter size is 3 x 3 and the eight gray pixels within the blue filter are the neighbors), and weighted pixel values within the filter are aggregated into a single value. The output from this convolution operation has a smaller size than the input image but has a higher level view of the input which is going to be useful for making predictions on image problems such as image classifications.

Image by author

In graphs, nodes are ordered in a nonstructural manner and the neighborhood size differs across nodes. Graph convolution takes the average of the node features of a given node (the red node in the below image) and its neighbors (the gray nodes within the blue circle) to calculate updated node representation values of the node. Through this convolution operation, the node representation captures localized graph information.

Image by author

The image below shows more details on the graph convolution operation. Node features from neighborhood nodes (blues) along with those of the target node (red) are averaged. And then it is multiplied with a weight vector (W) and its output updates the node features of the target node (the updated node values are also called node embeddings).

Graph Convolution Operation (Image by author)

For those who are interested, the node features are normalized using the inverse of the degree matrix and then aggregated in the original paper instead of simple averaging (equation (8) in the paper).

One thing to note in this convolution operation is that the number of graph convolutions determines how many steps away node features are aggregated into each node. In the image below, the first convolution aggregates the blue nodes’ features into the orange node and the second convolution can incorporate the green nodes’ features into the orange node.

The number of convolutions determines how far node features you aggregate (Image by author)

Cora — Graph Benchmark Dataset

In the next few sections, we will review GCN code implementations. Before we dive into them, let us get familiar with the dataset we are going to use. The Cora dataset is a paper citation network data that consists of 2,708 scientific publications. Each node in the graph represents each publication and a pair of nodes is connected with an edge if one paper cites the other.

Through this article, we are using PyG (Pytorch Geometric) to implement GCN which is one of the popular GNN libraries. The Cora dataset can also be loaded using PyG module:

The Cora dataset sourced from Pytorch Geometric is originally from the “Automating the Construction of Internet Portals with Machine Learning” paper.

The node features and the edge information look like below. The node features are 1433 word vectors indicating the absence (0) or the presence (1) of the words in each publication. The edges are represented in adjacency lists.

Node Features and Edge Lists (Image by author)

Each node has one of seven classes which is going to be our model target/label.

Class Distribution (Image by author)

The graph data can be visualized using NetworkX library. The node colors represent the node classes.

Visualization of Cora Dataset (Image by author)

Node Classification

For the node classification problem, we are splitting the nodes into train, valid, and test using the RandomNodeSplit module from PyG (we are replacing the original split masks in the data as it has a too small train set).

Please note the data splits are written into mask attributes in the graph object (see the image below) instead of splitting the graph itself. Those masks are used for training loss calculation and model evaluation, whereas graph convolutions use entire graph data.

Graph object has mask attributes (Image by author)

Baseline Model (MLP) on Node Classification

Before we build GCN, we are training MLP (multi-layer perceptron, i.e. feed-forward neural nets) only using node features to set a baseline performance. The model ignores the node connections (or the graph structure) and tries to classify the node labels only using the word vectors. The model class looks like below. It has two hidden layers (Linear) with ReLU activations followed by an output layer.

We are defining training and evaluation functions with a normal Pytorch train/eval setup.

The test accuracy from this model is 73.2%.

GCN on Node Classification

Next, we are training GCN and comparing its performance to MLP. We are using a very simple model having two graph convolution layers and ReLU activation between them. This setup is the same as the original GCN paper (equation 9).

GCN Node Classification Model Architecture (Image by author)

The test-set accuracy from this model is 88.0%. We achieved around 15% accuracy improvement from MLP.

Link Prediction

Link prediction is trickier than node classification as we need some tweaks to make predictions on edges using node embeddings. The prediction steps are described below:

  1. An encoder creates node embeddings by processing the graph with two convolution layers.
  2. We randomly add negative links to the original graph. This makes the model task binary classifications with the positive links from the original edges and the negative links from the added edges.
  3. A decoder makes link predictions (i.e. binary classifications) on all the edges including the negative links using node embeddings. It calculates a dot product of the node embeddings from pair of nodes on each edge. Then, it aggregates the values across the embedding dimension and creates a single value on every edge that represents the probability of edge existence.
Link Prediction Model Architecture (Image by author)

This model structure is from the original link prediction implementation in Variational Graph Auto-Encoders. The code looks like something below. This is adapted from the code example in PyG repo which is based on the Graph Auto-Encoders implementation.

For this link prediction task, we want to randomly split links/edges into train, valid, and test data. We can use the RandomLinkSplit module from PyG to do that.

The output data looks like something below.

RandomLinkSplit Output (image by author)

There are several things to note about this output data.

First, the split is performed on edge_index such that the training and the validation splits do not include the edges from the validation and the test split (i.e. only have the edges from the training split), and the test split does not include the edges from the test split. This is because edge_index (and x) is used for the encoder to create node embeddings, and this setup ensures that there are no target leaks on the node embeddings when it makes predictions on the validation/test data.

Second, two new attributes (edge_label and edge_label_index) are added to each split data. They are the edge labels and the edge indices corresponding to each split. edge_label_index will be used for the decoder to make predictions and edge_label will be used for model evaluation.

Third, negative links are added to both val_data and test_data with the same number as the positive links (neg_sampling_ratio=1.0). They are added to edge_label and edge_label_index attributes, but not added to edge_index as we do not want to use the negative links on the encoder (or node embedding creation). And also, we are not adding negative links to the training set here (by settingadd_negative_train_samples=False) as we will add them during the training loop in train_link_predictor above. This randomization during training is supposed to make the model more robust.

The image below summarizes how this edge split is performed for the encoder and the decoder (the colored edges are used in each stage).

Summary of Edge Split (image by author)

We can now train and evaluate the model with the following code.

The test AUC of this model is 92.5%.

Anomaly Detection

We are again using Cora dataset for the anomaly detection task, but it is slightly different from the previous one: the one with outliers being synthetically injected. The dataset has two different types of outliers (the outlier definition is from this paper):

  • Structural Outlier
    Densely connected nodes in contrast to sparsely connected regular nodes
  • Contextual Outlier
    Nodes whose attributes are significantly different from their neighboring nodes
Node Outlier Types (Source: https://arxiv.org/pdf/2206.10071.pdf)

For this anomaly detection task, we are using PyGOD library which is a graph outlier detection library built on top of PyG. We can load the outlier-injected Cora dataset via the PyGOD module.

The code below shows the outlier type distribution.

Output: Counter({0: 2570, 1: 68, 2: 68, 3: 2})

If you are interested in how these outliers are injected, you can look at the PyGOD documentation on the outlier generator modules which explains the operation details. Please note that the labels y will only be used for model evaluation and not be used for training labels as we are training an unsupervised model.

To detect those outliers, we are training DOMINANT (Deep Anomaly Detection on Attributed Networks) model from this paper. It is an auto-encoder network with graph convolution layers and its reconstruction errors are going to be the node anomaly scores. The model follows the steps below to make predictions.

  1. The attributed network encoder processes the input graph with three graph convolution layers that create node embeddings.
  2. The structure reconstruction decoder reconstructs the original graph edges (i.e. adjacency matrix) using the learned node embeddings. It calculates a dot product of node embeddings from every possible pair of nodes that creates a probability score on each node pair indicating edge existence.
  3. The attribute reconstruction decoder reconstructs the original node attributes using the obtained node embeddings. It has a graph convolution layer to predict the attribute values.
  4. In the last step, the reconstruction errors from the above two decoders are combined by weighted average on every node and the combined errors are going to be the final errors/losses. These final errors are also the abnormality scores of the nodes.
DOMINANT Model Architecture (Source: Deep Anomaly Detection on Attributed Networks)

The DOMINANT model can be easily implemented with PyGOD as you can see below.

The AUC from this model is 84.1%, whereas its average precision is 20.8%. This difference is most likely due to the target imbalance. As this is an unsupervised model, we might not be able to expect a very accurate model, but you can see in the original paper that it still outperforms any other popular anomaly detection algorithms.

That is it for this article!

If you are interested, the whole code is available in the Google Colab and the GitHub repo below.

References

--

--