
Graph data structures have proven to be efficient to capture complex non-euclidean data such as biological and social networks (Euclidean data is something like images, text or just plain numerical data).
We are successful in building predictive models using graphs as input and also in applying various deep learning techniques to graphs which in some cases, enhanced the performance of graph machine learning models.
So what can we predict using graphs and how can we build predictive models from such a complex data structure?
In this article, I will explain the formulation of a few graph-related tasks which are also termed graph downstream tasks. The formulation of these tasks usually comes after we learn the numerical representation of graph entities.
Graphs and their different types

We use graphs to model real-life interactions or networks and depending upon the context, we can use different types of graph structures such as:
- Undirected graph: Like a social network of friends where nodes represent individuals and the edges represent their friendship. Here, the edges do not have a directed nature.
- Directed graph: Here, we observe a directed nature of edges. For example, the network of Instagram followers where the nodes are individual accounts and the edges are the follows. Like, I follow Elon Musk but Elon does not follow me back. This adds a property to the edge which is the direction of the relationship.
- Single-relational graph: All nodes and edges here represent one property which can be friendship, interaction etc and they can be directed or undirected in nature.
- Multi-relational graph: Here, the nodes and edges can have different properties and a simple example would be a social network that represents different types of friends like school friends, university friends, etc. A Knowledge Graph (which is very popular these days) is a great example of a multi-relational graph.
Note: This is not an exhaustive list.
Usually, in graph Machine Learning, the first step is to capture the network information using an array of numbers that are called low-dimensional embeddings or simply the numerical representations of graph entities.
Now, let’s talk about the graph downstream tasks.
Node Classification

Nodes in the graph could be labelled and it is often the case with huge graphs to have some missing labels. Node classification is formulated as a binary classification task (in the case of 2 different types of labels) or a multi-class classification task (>2 labels).
Here, the input features to the predictive machine learning model are the node embeddings or simply the numerical representations such as graph properties like node-degree, graph coloring number etc.
For this classification task, you can use standard machine learning algorithms or neural network methods like Graph Neural Networks.
Link Prediction
This is one of the important tasks in graph machine learning. Link Prediction is the task of predicting missing edges or links in the graph and also predicting prospective edges (in the case of a dynamic graph where the edges disappear and form based on the time point of the graph).
In the context of biological networks like protein-protein interactions, using standard lab experiments, it is difficult to identify all the possible protein-protein interactions. Hence link prediction becomes crucial in this domain.
Depending upon the type of graph, link prediction can be formulated in different ways:
- As a binary classification task: For binary classification, we need two labels, for example, positive and negative labels. We take the edges present in the graph to create the instances of positive labels and we can randomly sample some negative edges to create the instances of negative labels. As an edge consists of two entities (two nodes), we do an element-wise multiplication of the node embeddings (corresponding to the edge) to get one feature vector for every instance (positive or negative). Now, these features and labels are used to build a predictive model. Graph neural networks can be used to tackle the link prediction problem as a binary classification problem.
- Scoring based method: The idea is simple. We learn the embeddings using a scoring function such that we score higher for the cases where we have an edge and lower for the cases where the edges are not present. Given a threshold for the score, we can identify the possibility of a prospective edge between two given nodes. In the case of knowledge graph link prediction, rank-based evaluation methods are commonly employed. Further, the triples in the knowledge graph can be modelled into head prediction (?, r, t) and tail prediction (h, r, ?) models. It all depends on the use case which is being built by using the link prediction task.
Community Detection/ Graph Clustering

This is an unsupervised task of clustering where the objective is to find an important group of entities that share a similar property. There exist several algorithms to find the clusters and also we can use machine learning techniques for this purpose.
In biological protein-protein networks, this clustering is also known as Complex detection. We identify the protein complexes which are a piece of important biological machinery.
Graph Classification
Given a bigger graph, it can contain some smaller graph clusters belonging to a certain class. Graph classification is a task aimed at predicting the label or class of the smaller cluster/ community within the bigger graph.
One example dataset for this kind of task is a collection of graphs like the PROTEINS dataset from TU Dortmund. Each graph in the dataset has a label mentioning if it is an enzyme or not an enzyme. In total there are 1113 graphs and each graph has on average 39 nodes and 73 edges. Since we have 2 labels, this is a task of binary classification.
Triple Classification

The entities in the context of knowledge graphs are treated as triples. Triple classification boils down to a binary classification task where there are two labels, namely, positive and negative. Positive triples are the ones that are part of the knowledge graph and negative ones are the ones that are not.
Learning good quality embeddings for the graph entities is necessary for graph machine learning in general and evidently, also holds true for the triple classification task. The embeddings that we generate should be able to score positive triples higher than the negative ones. We subsequently define a threshold to separate the positive triples from the negative ones.
As the knowledge graph is ever-growing with more triples added to it, there can exist a problem of bad quality data being injected into the database. It could be that a corrupted triple that doesn’t belong in the knowledge graph is injected into the knowledge graph and one way of identifying such bad quality data is by using triple classification.
The usage of the graph downstream tasks depends more on the domain and the use-cases that are being built to solve the corresponding domain problems. Remember that graphs are excellent tools to model complex real-life scenarios which can be related to domains like space, biology, telecommunications, the entertainment industry (movie, song recommendations) etc and there could be hundreds of problems that could be formulated in each domain and potentially use graph machine learning to solve it.
If you have reached this part of the post, thank you for reading and also your attention. I hope you found the post informative and you can reach me on LinkedIn, Twitter or GitHub.