The world’s leading publication for data science, AI, and ML professionals.

A Beginner’s Guide to Graph Neural Networks Using PyTorch Geometric – Part 1

Getting started with PyTorch Geometric

You have stumbled on Graph Neural Networks somehow and now you’re interested in using it to solve a problem. Let’s explore on how we can model a problem using different tools available at our disposal.

Photo by clark cruz from Pexels
Photo by clark cruz from Pexels

At first, we need a problem to solve.

Getting started.

Let’s pick a simple graph dataset like Zachary’s Karate Club. Here, the nodes represent 34 students who were involved in the club and the links represent 78 different interactions between pairs of members outside the club. There are two different types of labels i.e, the two factions. We can use this information to formulate a node classification task.

We divide the graph into train and test sets where we use the train set to build a graph neural network model and use the model to predict the missing node labels in the test set.

Here, we use PyTorch Geometric (PyG) python library to model the graph neural network. Alternatively, Deep Graph Library (DGL) can also be used for the same purpose.

PyTorch Geometric is a geometric deep learning library built on top of PyTorch. Several popular graph neural network methods have been implemented using PyG and you can play around with the code using built-in datasets or create your own dataset. PyG uses a nifty implementation where it provides an InMemoryDataset class which can be used to create the custom dataset (_Note: InMemoryDataset should be used for datasets small enough to load in the memory_).

A simple visualization of Zachary’s Karate Club graph dataset looks as follows:

Visualization of Zachary's Karate Club graph data (Source: me)
Visualization of Zachary’s Karate Club graph data (Source: me)

Formulate the problem.

In order to formulate the problem, we need:

  1. The graph itself and the labels for each node
  2. The edge data in the Coordinate Format (COO)
  3. Embeddings or numerical representations for the nodes

Note: For the numerical representation for nodes, we can use graph properties like degree or use different embedding generation methods like node2vec, DeepWalk etc. In this example, I will be using node degree as its numerical representation.

Let’s get into the coding part.

Preparations.

The karate club dataset can be loaded directly from the NetworkX library. We retrieve the labels from the graph and create an edge index in the coordinate format. The node degree was used as embeddings/ numerical representations for the nodes (In the case of a directed graph, in-degree can be used for the same purpose). Since degree values tend to be diverse, we normalize them before using the values as input to the GNN model.

With this, we have prepared all the necessary parts to construct the Pytorch Geometric custom dataset.

The Custom Dataset.

The KarateDataset class inherits from the InMemoryDataset class and use a Data object to collate all information relating to the karate club dataset. The graph data is then split into train and test sets, thereby creating the train and test masks using the splits.

The data object contains the following variables:

_Data(edge_index=[2, 156], num_classes=[1], test_mask=[34], trainmask=[34], x=[34, 1], y=[34])

This custom dataset can now be used with several graph neural network models from the Pytorch Geometric library. Let’s pick a Graph Convolutional Network model and use it to predict the missing labels on the test set.

Note: PyG library focuses more on node classification task but it can also be used for link prediction.

Graph Convolutional Network.

The GCN model is built with 2 hidden layers and each hidden layer contains 16 neurons. Let’s train the model!

Train the GCN model.

Initial experiments with random hyperparameters gave these results:

Train Accuracy: 0.913 Test Accuracy: 0.727

This is not impressive and we can certainly do better. In my next post, I will discuss how we can use Optuna (python library on hyperparameter tuning) to tune the hyperparameters easily and find the best model. The code used in this example was taken from the PyTorch Geometric’s GitHub repository with some modifications (link).

A Summary.

To summarize everything we have done so far:

  1. Generate numerical representations for each node in the graph (node degree in this case).
  2. Construct a PyG custom dataset and split data into train and test.
  3. Use a GNN model like GCN and train the model.
  4. Make predictions on the test set and calculate the accuracy score.

Acknowledgement: Most of the explanations made in this post were the concepts that I learned and applied during my internship at Orange Labs in Cesson-Sévigné. I worked with graph embedding methods and also Graph Neural Networks which can be applied on a knowledge graph.

In the coming parts, I will explain more on how we can use graph embeddings as initial node representations to handle the same task of node classification.


Thanks for reading and cheers!

Want to Connect?
Reach me at LinkedIn, Twitter, GitHub or just Buy Me A Coffee!

Related Articles