The world’s leading publication for data science, AI, and ML professionals.

Machine Learning with Graphs: A Development Workflow Overview

A conceptual overview of where machine learning tasks with graphs take place in the ML life cycle.

Hands-on Tutorials

Image by the author.
Image by the author.

The way machine learning with graphs helps to build prediction models is very similar to how well-known unsupervised and semi-supervised approaches are being applied to supervised models. This means in general there are two ways that machine learning with graphs can be deployed into the ML workflow. The first way to do that is by creating a so-called node embedding and passing that into a downstream machine learning task. The second way to apply machine learning with graphs is by doing the label and link predictions directly on the graph data structure. Earlier I’ve written an introduction to machine learning with graphs and what tasks are included. This article is an addition to that post and will focus on giving a concise overview of how these tasks are embedded into the ML workflow.

Machine Learning development workflow

A typical machine learning development workflow consists of the following phases given in blue boxes:

A supervised machine learning workflow enhanced with graphs learning techniques during feature engineering and model training. Image by the author.
A supervised machine learning workflow enhanced with graphs learning techniques during feature engineering and model training. Image by the author.

The machine learning development workflow starts with data retrieval, where (mostly) unstructured raw data is being acquired from the sources. Thereafter in the data preparation step, the goal is to transform the data in a structured format so that it can be used for model training. After the model is trained it can be evaluated using various performance and validation methods. During the data preparation phase, a key task called feature engineering has to be done. "Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." Thus, __ data scientists have to engineer and decide which features are important in predicting the target label. Therefore, having a proper business domain knowledge of the use case is a prerequisite to perform the feature engineering task.

Feature engineering with graph data

This is where feature engineering using graphs structured data can help. The underlying assumption why this works is that data points do not exist in isolation, they occur through interaction with other records in the data set. Instead of relying on the creativity and domain knowledge of the data scientist to come up with features that are important according to them, one can build a machine learning model during this phase that generates features based upon relational properties between the data points (records) in the dataset. Those features are called node embeddings, where every node is a data point. Thus, every node embedding is a representation of a structural relationship that it has with other data points in the data set. Interactions between nodes with their neighborhood nodes are captured in the resulted node embedding. There is a wide range of methods to create a node embedding. Many of those methods aggregate properties of the local neighborhood nodes and edges into the embedding using a graph neural network (GNN) or a deep learning variant of it such as graph convolutional network (GCN). In its essence, the GNN tries to encode the node properties and the relationships it has with other nodes into a vector in a latent space. However, these graph learning techniques are another topic that I will address in a separate post. The result of this phase is a node embedding that will be passed as a feature to a downstream task for any classification or regression prediction.

Left: Traditional Feature Engineering step requires the data scientist to hand-pick the features during feature selection that he or she thinks are relevant for prediction. Right: Feature engineering with graphs does not have feature selection, instead it trains a graph neural network to determine node properties and relationships, as well as graph statistics, to include that determine the feature vector in form of a node embedding. Image by the author.
Left: Traditional Feature Engineering step requires the data scientist to hand-pick the features during feature selection that he or she thinks are relevant for prediction. Right: Feature engineering with graphs does not have feature selection, instead it trains a graph neural network to determine node properties and relationships, as well as graph statistics, to include that determine the feature vector in form of a node embedding. Image by the author.

Training and predicting on the graph

In the approach above, the outcome from feature engineering results into a node embedding that then will be used as an input feature vector for another downstream machine learning model. However, it is also possible to predict the nodes or links between the nodes directly from the graph-structured data. These tasks are so-called node classification and link prediction. The key here is that, instead of hand-picking which relationship structure is leading to better predictions, the model is being trained to find the most important structural pattern and set of node properties to assign labels and predict relationships.

Training a node embedding using a graph neural network to predict node labels or edges. The input is a node embedding with a one-hot encoding vector of the node label, the output is a predicted node label. Image by the author.
Training a node embedding using a graph neural network to predict node labels or edges. The input is a node embedding with a one-hot encoding vector of the node label, the output is a predicted node label. Image by the author.

In general, node classification is achieved by training a graph neural network in a fully supervised manner, where a loss function is defined on a node embedding (created in the earlier feature engineering step described above) and a one-hot vector that indicates the class to which the nodes belongs to (the label). Thus, the model will be trained upon the node embedding and the output of the graph neural network will label each node. As a result, the node has been labeled so there is no need to pass the node embedding as a feature vector into a downstream task for classification or regression.

Conclusion

First, machine learning with graphs can replace the feature selection task during the feature engineering phase. Where traditional machine learning workflow relies on the data scientists’ insight to select features, ML with graphs trains a graph neural network to output a feature vector, called node embedding, for every node. This node then can be passed into any downstream classifier. Secondly, it also possible to do predictions directly on the graph data structure. In this case, a node embedding will be passed into a graph neural network to predict its label.

Sources

Graph Representation Learninghttps://www.cs.mcgill.ca/~wlh/grl_book/

Representation Learning on Graphs: Methods and Applicationshttps://arxiv.org/abs/1709.05584

Discover Feature Engineeringhttps://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/


Related Articles