fastgraphml: A Low-code framework to accelerate the Graph Machine Learning model development process

Democratizing Graph Machine Learning with Low-code

Sachin Sharma
Towards Data Science

--

fastgraphml (Image by Author)

Graph Machine Learning (ML) is a rapidly growing area of machine learning. It has attracted a large number of people from various domains ranging from social science to chemistry, biology, physics, and e-commerce. The reason behind this is the fact that Graph-structured data is ubiquitous. Today, many real-world applications are backed by Graph ML to run their prediction services. For example, UberEats leverages Graph ML to suggest to its users the dishes, restaurants, and cuisines they might like next. Pinterest uses Graph ML to make visual recommendations, and Google Deep Mind exploits Graph ML to make traffic predictions. Given the rise of this fascinating field, we are proud to release the fastgraphml package (built on top of PyG) that can help users to build Graph ML models with just 3 lines of code. The first release is focused on providing Graph Embeddings (or node embeddings) as functionality as it acts as a foundation for all the Graph ML problems (node classification, link prediction, graph classification). In addition, the framework uses ArangoDB (the next-generation graph data and analytics platform) as a backend to export graphs directly into the fastgraphml package. In this blog post, we will explore the following:

  • Why do we need graph embeddings?
  • What are graph embeddings and how do they work?
  • What are Homogeneous and Heterogeneous graph embeddings?
  • What are the applications of graph embeddings?
  • How to get started with graph embeddings using fastgraphml?
  • Conclusion

Why do we need Graph Embeddings?

Once the graph is created after incorporating meaningful relationships (edges) between all the entities (nodes) of the graph. The next question that comes into mind is finding a way to integrate the information about graph structure (e.g. information about the node’s global position in the graph or its local neighborhood structure) into a machine learning model. One way to extract structural information (or features) from the graph is to compute its graph statistics using node degrees, clustering coefficients, Page Rank, kernel functions, or hand-engineered features to estimate local neighborhood structures. Once these features are extracted, they can be used as input to the ML classifiers such as SVM or Logistic Regression

Traditional ML on Graphs. Node features are extracted by the feature extractor and then used as input to ML algorithms. (Image by Author)

However, the above approaches are limited in the sense that they are algorithm-dependent and discover specific predictive signals present inside graphs. For example, the clustering coefficient feature extractor will measure the degree to which nodes in a graph tend to cluster together (helpful for community detection and link prediction problems). Similarly, Page Rank detects the importance of a node in a graph. These methods lack end-to-end learning of graph features i.e features cannot be learned with the help of loss function during the training process. In addition, designing these feature extractors can be a time-consuming and expensive process (as it requires manual feature engineering).

On the other hand, Graph Embeddings (aka Graph Representation Learning) look for a more generalized set of features to represent graphs instead of capturing specific ones. The main difference between hand-tuned feature extractors and Graph Embedding is how they handle the challenge of representing graphs for machine learning. The first one tackles it as a pre-processing step and the latter one uses a data-driven methodology (part of the machine learning training process) to learn embeddings that summarize graph topology. Thus, learning embeddings via a data-driven approach saves plenty of time in contrast to trial-and-error manual feature engineering.

What are Graph (Node) Embeddings and how do they work?

One way to extract graph features through an end-to-end process is to adopt the representation learning methods that encode the structural information about the graphs into the d-dimensional euclidean space (aka vector space or embedding space). The key idea behind graph representation learning is to learn a mapping function that embeds nodes, or entire (sub)graphs (from non-euclidean), as points in low-dimensional vector space (to embedding space). The resultant d-dimensional vector is known as Graph (Node) embeddings.

Mapping Process in Graph Representation Learning (image credit: Stanford-cs224w)

Working: The aim is to optimize this mapping so that nodes that are nearby in the original network should also remain close to each other in the embedding space (vector space) while shoving unconnected nodes apart. Therefore by doing this, we can preserve the geometric relationships of the original network and semantics for downstream tasks (e.g. link prediction, node/graph classification, community detection, and node clustering) inside Graph Embeddings.

Let’s understand the working more intuitively with an interesting example from the graph structure of the Zachary Karate Club social network. In this graph, the nodes represent the persons and there exists an edge between the two persons if they are friends. The coloring in the graph represents different communities. Figure A) represents the Zachary Karate Club social network and B) illustrates the 2D visualization of node embeddings created from the Karate graph. If you analyze both diagrams you will find that the mapping of nodes from a graph structure (non-euclidean or irregular domain) to an embedding space (figure B) is done in such a manner that the distances between nodes in the embedding space mirror closeness in the original graph (preserving the structure of the node’s neighborhood). For e.g, the community of the people marked as violet and green shares close proximity in the karate graph as compared to the communities violet and sea green which are far away from each other. The same pattern can also be seen in figure B.

image credit: graph representation learning

Homogeneous Graph Embeddings

Homogeneous graphs are those (undirected) graphs that are made up of only one type of node and one type of link (relation). For example, an amazon product graph (ogbn-products) in which nodes represent amazon products (e.g. toys, books, electronics, home & Kitchen, etc.) and edges correspond to products that are bought together. The embeddings generated from this type of graph are known as Homogeneous graph embeddings.

2D visualization of amazon product graph (node) embeddings using GraphSage. Image by Author.

Heterogeneous Graph Embeddings

Heterogeneous graphs are those graphs that are capable of containing different types of nodes and links (or relations). For example, a bibliographic graph (aka academic graph) can depict a heterogeneous graph composed of four types of nodes i.e. author, paper, conference, and term) and three types of relations i.e. author-writes-paper, paper-contain-term, and conference-publish-paper.

An academic graph with 4 types of nodes and 3 types of edges (image credit: Heterogeneous graphs survey by Xiao Wang et al.)

The embeddings generated from this type of graph are known as Heterogeneous graph embeddings. The following figure illustrates 2D visualization of the author node embeddings using the DBLP dataset (a heterogeneous academic graph). The author nodes belong to one of the four research areas — database, data mining, artificial intelligence, and information retrieval.

2D visualization of author node embeddings in the DBLP dataset using metapath2vec. Image by Author.

Applications of Graph Embeddings

Once graph (node) embeddings are generated they can be used for various downstream machine learning tasks such as :

  • It can be used as a feature input for downstream ML tasks (eg. community detection, node classification, and link prediction)
  • We could construct a KNN/Cosine similarity graph from embeddings. The graph could be used to make recommendations (e.g. product recommendations)
  • Visual exploration of data by reducing them to 2 or 3 dimensions using U-Map, and t-SNE algorithms (eg. performing clustering).
  • Proximity Search i.e. given any input node in the graph locate nodes that are close to the input node (uses structural and semantic information).
  • Drug Discovery
  • Fraud Detection
  • Dataset Comparisons
  • Transfer Learning

Graph embeddings with fastgraphml

fatgraphml — Given an input graph, it generates Graph Embeddings using a Low-Code framework built on top of PyG. The package supports training on both GPU and CPU enabled machines. Training jobs on GPUs results in much faster execution and increased performance when it comes to handling large graphs as compared to CPUs. In addition, the framework provides tight integration with ArangoDB which is a scalable, fully managed graph database, document store, and search engine in one place. Once Graph Embeddings are generated, they can be used for various downstream machine learning tasks like Node Classification, Link Prediction, Visualisation, Community Detection, Similarity Search, Recommendation, etc.

Generating graph embeddings through fastgraphml feels like a breeze.

Installation

Required Dependencies

  1. PyTorch 1.12.* is required.
  • Install using the previous version that matches your CUDA version: PyTorch
  • To find your installed CUDA version run nvidia-smi in your terminal.

2. pyg

3. FAISS

  • Note: For FAISS-CPU one needs numba==0.53.0

Install fastgraphml

pip install fastgraphml

Once fastgraphml is installed, we can start generating Graph Embeddings using graphs present inside ArangoDB or just using PyG graph data object.

Let’s look into different use cases:

Generates Graph Embeddings using the graphs stored inside ArangoDB:

Homogeneous Graphs

from fastgraphml.graph_embeddings import SAGE, GAT
from fastgraphml.graph_embeddings import downstream_tasks
from fastgraphml import Datasets
from arango import ArangoClient

# Initialize the ArangoDB client.
client = ArangoClient("http://127.0.0.1:8529")
db = client.db('_system', username='root', password='')

# Loading Amazon Computer Products dataset into ArangoDB
Datasets(db).load("AMAZON_COMPUTER_PRODUCTS")

# Optionally use arangodb graph
# arango_graph = db.graph('product_graph')

# metadata information of arango_graph
metagraph = {
"vertexCollections": {
"Computer_Products": {"x": "features", "y": "label"}, # mapping features attribute present in collection to x (node feature)
}, # mapping label attribute present in collection to y (node label)
"edgeCollections": {
"bought_together": {},
},
}

# generating graph embeddings with 3 lines of code
model = SAGE(db,'product_graph', metagraph, embedding_size=256) # define graph embedding model
model._train(epochs=6, lr=0.0001) # train
embeddings = model.get_embeddings() # get embeddings
Homogeneous Graph Detected ........ 

{'Nodes': 13471, 'Node_feature_size': 767, 'Number_of_classes': 10, 'Edges': 491722, 'Edge_feature_fize': None, 'Graph Directionality': 'Undirected', 'Average node degree': '36.50', 'Number of training nodes': 10777, 'Number of val nodes': 1347, 'Number of test nodes': 1347, 'Has isolated nodes': False}
Training started .........
Epoch: 001, Train_Loss: 1.3626, Val: 0.7996, Test: 0.8048
Val Acc increased (0.00000 --> 0.79955). Saving model ...
Epoch: 002, Train_Loss: 1.2654, Val: 0.8211, Test: 0.8233
Val Acc increased (0.79955 --> 0.82108). Saving model ...
Epoch: 003, Train_Loss: 1.1866, Val: 0.8300, Test: 0.8315
Val Acc increased (0.82108 --> 0.82999). Saving model ...
Epoch: 004, Train_Loss: 1.0630, Val: 0.8293, Test: 0.8344
Epoch: 005, Train_Loss: 1.0818, Val: 0.8352, Test: 0.8382
Val Acc increased (0.82999 --> 0.83519). Saving model ...

In addition, the library also provides various low-code helper methods to carry out a number of downstream tasks such as visualization, similarity search (recommendation), and link prediction (to be added soon).

Downstream Task 1: Graph Embedding Visualisation

This method helps in the visualization of generated Graph Embeddings by reducing them to 2 dimensions using U-Map.

class_names = {0: 'Desktops',1: 'Data Storage',2: 'Laptops',3: 'Monitors',4: 'Computer Components',
5: 'Video Projectors',6: 'Routers',7: 'Tablets',8: 'Networking Products',9: 'Webcams'}

# emb visualization
# model.G access PyG data object
downstream_tasks.visualize_embeddings(model.G, embeddings, class_mapping=class_names, emb_percent=0.1)
2D visualization of amazon computer products (graph) node embeddings. Image by Author.

Downstream Task 2: Scalable Similarity Search (recommendation) with Faiss

Faiss is a tool developed by Facebook that performs similarity search in sets of vectors of any size, up to ones that possibly do not fit in RAM. We support two types of search for now:

  1. exact search: For precise similarity search but at the cost of scalability.
  2. approx search: For scalable similarity search but at the cost of some precision loss.

Let’s perform recommendations — given any computer product which other product can be bought together?

# returns top 50 similar products (ids) along with similarity distance 
distance, nbors = downstream_tasks.similarity_search(embeddings, top_k_nbors=50)


# Let's pick a random computer prodcut for which we want recommendation
# model.G.y access PyG node labels
class_names[model.G.y[5].item()]
'Data Storage'
# recommend computer product that can be bought together with 'Data Storage' product
class_names[model.G.y[nbors[5][40]].item()]
'Networking Products'

Node Classification with Graph Embeddings

In the real world, most of the datasets are without labels (or have few labels) and often imbalanced. The class imbalance and sparse labels make supervised learning a challenging task. It can also lead to higher false negatives and imbalanced datasets can result in models with more false positives. Hence, training GNNs with unsupervised objectives and using their latent representation (node embeddings) downstream can provide promising results. Once graph (node) embeddings are generated using unsupervised learning, they can be used as feature inputs to machine learning classification models to perform the task of node classification.

The below code shows that even less amount of training data (with labels) i.e 10% can generalize well on the unseen data (test data) which is in contrast with the above SAGE class where we have used 80% of data as training to test the performance of generated graph embeddings.

# Dataset Splitting
X_train, X_test, y_train, y_test = train_test_split(
embeddings, model.G.y, train_size=0.1, test_size=None, stratify=model.G.y, random_state=42)
# Training with Logistic Regression
clf = LogisticRegression(max_iter=1000, solver="lbfgs")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# evalute accuracy on test set
accuracy_score(y_test, y_pred)
0.8094688221709007

Store Homogeneous Graph Embeddings in ArangoDB

fastgraphml also provides a helper method store_embeddings to store generated graph embeddings inside ArangoDB.

Note: If nearest_nbors_search=True, the store_embeddings method saves generated Graph Embeddings in ArangoDB along with top_k nearest neighbors (node ids with similar embeddings) and their corresponding similarity scores (i.e. cosine distance).

model.graph_util.store_embeddings(embeddings, collection_name='computer_products_embeddings', batch_size=100,
class_mapping=class_names, nearest_nbors_search=True)

Heterogeneous Graphs

from fastgraphml.graph_embeddings import METAPATH2VEC
from fastgraphml.graph_embeddings import downstream_tasks
from arango import ArangoClient
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# Initialize the ArangoDB client.
client = ArangoClient("http://127.0.0.1:8529")
db = client.db('_system', username='root')
# Loading ArangoDB Graph
arango_graph = db.graph("DBLP")
# metadata information of arango_graph (we assume DBLP graph already exists in ArangoDB)
metagraph = {
"vertexCollections": {

"author": { "x": "x", "y": "y"},
"paper": {"x": "x"},
"term": {"x": "x"},
"conference": {}
},
"edgeCollections": {
"to": {},
},
}

# APCPA
metapaths = [('author', 'to', 'paper'),
('paper', 'to', 'conference'),
('conference', 'to', 'paper'),
('paper', 'to', 'author'), ]
# generating graph embeddings with 3 lines of code
model = METAPATH2VEC(db, arango_graph, metagraph, metapaths, key_node='author',
embedding_size=128, walk_length=5, context_size=6, walks_per_node=10,
num_negative_samples=10,sparse=True) # define model
model._train(epochs=15, lr=0.03) # train
embeddings = model.get_embeddings() # get embeddings
Heterogeneous Graph Detected .......... 

{'Nodes': 26128, 'Edges': 239566, 'node_types': ['author', 'paper', 'term', 'conference'], 'edge_types': [('author', 'to', 'paper'), ('paper', 'to', 'author'), ('paper', 'to', 'term'), ('paper', 'to', 'conference'), ('term', 'to', 'paper'), ('conference', 'to', 'paper')], 'Graph Directionality': 'Directed', 'Has isolated nodes': True, 'node stats': {'Number of author nodes': 4057, 'Number of train author nodes': 3245, 'Number of val author nodes': 406, 'Number of test author nodes': 406, 'Number of classes in author nodes': 4, 'number of paper nodes': 14328, 'number of term nodes': 7723, 'number of conference nodes': 20}}
Training started .........
Epoch: 001, Train_Loss: 8.7637, Val: 0.3399, Test: 0.3842
Val Acc increased (0.00000 --> 0.33990). Saving model ...
Epoch: 002, Train_Loss: 6.0169, Val: 0.5000, Test: 0.5369
Val Acc increased (0.33990 --> 0.50000). Saving model ...
Epoch: 003, Train_Loss: 4.9843, Val: 0.6749, Test: 0.6650
Val Acc increased (0.50000 --> 0.67488). Saving model ...
Epoch: 004, Train_Loss: 4.3761, Val: 0.7980, Test: 0.7956
Val Acc increased (0.67488 --> 0.79803). Saving model ...
Epoch: 005, Train_Loss: 3.4619, Val: 0.8719, Test: 0.8522
Val Acc increased (0.79803 --> 0.87192). Saving model ...
Epoch: 006, Train_Loss: 2.9975, Val: 0.8867, Test: 0.8695
Val Acc increased (0.87192 --> 0.88670). Saving model ...
Epoch: 007, Train_Loss: 2.4220, Val: 0.9163, Test: 0.8818
Val Acc increased (0.88670 --> 0.91626). Saving model ...
Epoch: 008, Train_Loss: 2.0990, Val: 0.9187, Test: 0.8867
Val Acc increased (0.91626 --> 0.91872). Saving model ...
Epoch: 009, Train_Loss: 1.8748, Val: 0.9163, Test: 0.8793
Epoch: 010, Train_Loss: 1.6358, Val: 0.9089, Test: 0.9015
Epoch: 011, Train_Loss: 1.6156, Val: 0.9138, Test: 0.9089
Epoch: 012, Train_Loss: 1.4696, Val: 0.9261, Test: 0.9089
Val Acc increased (0.91872 --> 0.92611). Saving model ...
Epoch: 013, Train_Loss: 1.2789, Val: 0.9163, Test: 0.8892
Epoch: 014, Train_Loss: 1.2143, Val: 0.9187, Test: 0.9138
# Metapath2Vec compute embeddings for those nodes which are present in metapath
embeddings
{'author': array([[ 0.3469685 , -0.73929137, -0.3658532 , ..., -0.07065899,
0.01433279, -0.00440213],
[-0.18779977, 0.0759825 , -0.38714892, ..., -0.13985269,
-0.7717297 , -0.55180293],
[-0.27399492, -0.1742627 , 0.01303964, ..., 0.08810424,
-0.4031429 , 0.20701364],
...,
[ 0.1659177 , 0.11282699, -0.14390166, ..., 0.17577603,
-0.28433827, 0.16120055],
[ 0.01443969, 0.1374461 , 0.5134789 , ..., 0.33182082,
0.2584621 , 0.00462335],
[ 0.22391362, -0.50708103, 0.34233156, ..., 0.03449561,
0.16480075, 0.39390147]], dtype=float32),
'conference': array([[-2.1632937e-01, -5.3228494e-02, 1.5947707e-01, ...,
5.1428860e-01, 8.5533451e-04, -3.4591302e-01],
[ 6.9806822e-02, 5.0240862e-01, -2.3461170e-01, ...,
-4.9915221e-01, 1.5349187e-01, -1.8434562e-01],
[-5.0854170e-01, -9.7937733e-02, -1.0179291e+00, ...,
-1.8171304e-01, 6.6947944e-02, -3.5466689e-01],
...,
[ 6.2907688e-02, -8.9021228e-02, 3.4244403e-02, ...,
-1.6124582e-02, 5.2124184e-01, 3.5454047e-01],
[-1.1044691e+00, 3.7697849e-01, -3.7053806e-01, ...,
-2.4933312e-02, 7.9877669e-01, 3.4990273e-02],
[-8.0069518e-01, 6.9776934e-01, -5.1909280e-01, ...,
-2.3521569e-03, -7.8969456e-02, 9.5190950e-02]], dtype=float32),
'paper': array([[-1.6482981 , 0.7253625 , 1.0436039 , ..., 1.4693767 ,
-1.5437169 , -0.0564078 ],
[-0.22423816, 0.34060782, -0.09682338, ..., -0.3744318 ,
-0.4454421 , -1.3889308 ],
[-0.00360703, -1.0357887 , -0.6753541 , ..., -0.6235587 ,
-0.2809864 , -0.6067877 ],
...,
[-0.08141378, 1.0001668 , 0.57556117, ..., -1.494264 ,
-0.13634554, 1.0170926 ],
[-1.0099323 , 0.67756116, 0.5964136 , ..., -0.6101154 ,
-1.1644614 , 0.04493611],
[ 0.09980668, 0.178698 , 0.52335536, ..., -1.1220363 ,
-1.161221 , 0.35191363]], dtype=float32)}

Similarity Search/Recommendation

# returns top 10 similar authors(ids) along with similarity distance 
distance, nbors = downstream_tasks.similarity_search(embeddings['author'], top_k_nbors=10)
# recommend similar authors based on common research areas and common conferences
nbors
array([[   0,  670,   14, ..., 1132,  230, 2585],
[ 1, 14, 1132, ..., 404, 22, 1730],
[ 2, 3718, 1201, ..., 3784, 3848, 3820],
...,
[4054, 1795, 2235, ..., 1389, 4012, 3991],
[4055, 3104, 2803, ..., 2530, 1364, 3900],
[4056, 3979, 2630, ..., 4013, 4006, 3991]])

Node Classification with Graph Embeddings

# Dataset Splitting
X_train, X_test, y_train, y_test = train_test_split(
embeddings['author'], model.G['author'].y.cpu().numpy(), train_size=0.1, test_size=None,
stratify=model.G['author'].y.cpu().numpy(), random_state=42)
# Training with Logistic Regression
clf = LogisticRegression(max_iter=1000, solver="lbfgs")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# evalute accuracy on test set
accuracy_score(y_test, y_pred)
0.9173055859802848

Store Heterogeneous Graph Embeddings in ArangoDB

model.graph_util.store_embeddings(embeddings['author'], collection_name='author_embeddings', node_type='author')

Generate Graph Embeddings using PyG graphs:

from fastgraphml.graph_embeddings import SAGE, GAT
from fastgraphml.graph_embeddings import downstream_tasks
from torch_geometric.datasets import Planetoid

# load pyg dataset
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

# generating graph embeddings with 3 lines of code
model = SAGE(pyg_graph=data, embedding_size=64) # define graph embedding model
model._train(epochs=10) # train
embeddings = model.get_embeddings() # get embeddings

Conclusion

In this article, we learn what were the traditional ways of doing machine learning on graphs — graph feature extractors such as node degree, clustering coefficients, etc. and their limitations. Then, we look into the importance of graph embeddings over manual graph feature engineering. Further, I shed some light on what graph embeddings are — a summary of a graph that encodes structural and semantic information in a d-dimensional vector space. After that, we saw a very high-level intuitive explanation of Homogeneous and Heterogeneous graph embeddings. In the last section, we talk about how to get started with fastgraphml — a low-code package to build graph machine-learning models quickly. fastgraphml package provides plenty of functionalities — building graph ML model with 3 lines of code, graph embeddings generation for Homogeneous and Heterogeneous graphs, graph embedding visualization, similarity search with graph embeddings, exporting graphs from ArangoDB and saving graph embeddings to ArangoDB, and generating graph embeddings directly from PyG data objects. ArangoGraphML team is also building a GraphML platform (Enterprise-ready, graph-powered machine learning as a cloud service) and is currently open as a beta program.

Any kind of feedback and feature requests for the next release will be welcomed!!

Acknowledgments

I would like to thank the whole ML team of ArangoDB for providing me with valuable feedback while I was writing this package.

Want to connect with me: Linkedin

--

--

Graph Machine Learning Research Engineer @ArangoDB Gmbh | Former AI/Machine Learning Scientist & Engineer @DefineMedia Gmbh | Former Research Intern @DFKI KL