KGCNs: Machine Learning over Knowledge Graphs with TensorFlow

Published in

Towards Data Science

6 min readMar 6, 2019

This project introduces a novel model: the Knowledge Graph Convolutional Network (KGCN), available free to use from the GitHub repo under Apache licensing. It’s written in Python, and available to install via pip from PyPi.

The principal idea of this work is to forge a bridge between knowledge graphs, automated logical reasoning, and machine learning, using TypeDB as the knowledge graph.

Summary

A KGCN can be used to create vector representations, embeddings, of any labelled set of TypeDB Things via supervised learning.

A KGCN can be trained directly for the classification or regression of Things stored in TypeDB.
Future work will include building embeddings via unsupervised learning.

What is it used for?

Often, data doesn’t fit well into a tabular format. There are many benefits to storing complex and interrelated data in a knowledge graph, not least that the context of each datapoint can be stored in full.

However, many existing machine learning techniques rely upon the existence of an input vector for each example. Creating such a vector to represent a node in a knowledge graph is non-trivial.

In order to make use of the wealth of existing ideas, tools and pipelines in machine learning, we need a method of building these vectors. In this way we can leverage contextual information from a knowledge graph for machine learning.

This is what a KGCN can achieve. Given an example node in a knowledge graph, it can examine the nodes in the vicinity of that example, its context. Based on this context it can determine a vector representation, an embedding, for that example.

There are two broad learning tasks a KGCN is suitable for:

Supervised learning from a knowledge graph for prediction e.g. multi-class classification (implemented), regression, link prediction
Unsupervised creation of Knowledge Graph Embeddings, e.g. for clustering and node comparison tasks

In order to build a useful representation, a KGCN needs to perform some learning. To do that it needs a function to optimise. Revisiting the broad tasks we can perform, we have different cases to configure the learning:

In the supervised case, we can optimise for the exact task we want to perform. In this case, embeddings are interim tensors in a learning pipeline
To build unsupervised embeddings as the output, we optimise to minimise some similarity metrics across the graph

Methodology

The ideology behind this project is described here, and a video of the presentation. The principles of the implementation are based on GraphSAGE, from the Stanford SNAP group, heavily adapted to work over a knowledge graph. Instead of working on a typical property graph, a KGCN learns from contextual data stored in a typed hypergraph, TypeDB. Additionally, it learns from facts deduced by TypeDB’s automated logical reasoner. From this point onwards some understanding of TypeDB’s docs is assumed.

Now we introduce the key components and how they interact.

KGCN

A KGCN is responsible for deriving embeddings for a set of Things (and thereby directly learn to classify them). We start by querying TypeDB to find a set of labelled examples. Following that, we gather data about the context of each example Thing. We do this by considering their neighbours, and their neighbours’ neighbours, recursively, up to K hops away.

We retrieve the data concerning this neighbourhood from TypeDB (diagram above). This information includes the type hierarchy, roles, and attribute value of each neighbouring Thing encountered, and any inferred neighbours (represented above by dotted lines). This data is compiled into arrays to be ingested by a neural network.

Via operations Aggregate and Combine, a single vector representation is built for a Thing. This process can be chained recursively over K hops of neighbouring Things. This builds a representation for a Thing of interest that contains information extracted from a wide context.

In supervised learning, these embeddings are directly optimised to perform the task at hand. For multi-class classification this is achieved by passing the embeddings to a single subsequent dense layer and determining loss via softmax cross entropy (against the example Things’ labels); then, optimising to minimise that loss.

A KGCN object brings together a number of sub-components, a Context Builder, Neighbour Finder, Encoder, and an Embedder.

The input pipeline components are less interesting, so we’ll skip to the fun stuff. You can read about the rest in the KGCN readme.

Embedder

To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a ‘summary’ representation remains — an embedding (diagram below).

To create the pipeline, the Embedder chains Aggregate and Combine operations for the K-hops of neighbours considered. e.g. for the 2-hop case this means Aggregate-Combine-Aggregate-Combine.

The diagram above shows how this chaining works in the case of supervised classification.

The Embedder is responsible for chaining the sub-components Aggregator and Combiner, explained below.

Aggregator

An Aggregator (pictured below) takes in a vector representation of a sub-sample of a Thing’s neighbours. It produces one vector that is representative of all of those inputs. It must do this in a way that is order agnostic since the neighbours are unordered. To achieve this we use one densely connected layer, and maxpool the outputs (maxpool is order-agnostic).

Combiner

Once we have Aggregated the neighbours of a Thing into a single vector representation, we need to combine this with the vector representation of that thing itself. A Combiner achieves this by concatenating the two vectors, and reduces the dimensionality using a single densely connected layer.

Supervised KGCN Classifier

A Supervised KGCN Classifier is responsible for orchestrating the actual learning. It takes in a KGCN instance and as for any learner making use of a KGCN, it provides:

Methods for train/evaluation/prediction
A pipeline from embedding tensors to predictions
A loss function that takes in predictions and labels
An optimiser
The backpropagation training loop

It must be the class that provides these behaviours, since a KGCN is not coupled to any particular learning task. This class, therefore, provides all of the specialisations required for a supervised learning framework.

Below is a slightly simplified UML activity diagram of the program flow.

Build with KGCNs

To start building with KGCNs, take a look at the readme’s quickstart, ensure that you have all of the requirements and follow the sample usage instructions, reiterated below:

This will get you on your way to build a multi-class classifier for your own knowledge graph! There’s also an example in the repository with real data that should fill in any gaps the template usage misses.

If you like what we’re up to, and you use/are interested in KGCNs there are several things you can do:

Submit an issue for any problems you encounter installing or using KGCNs
Star the repo if you’re inclined to help us raise the profile of this work :)
Ask questions, propose ideas or have a conversation with us on the Vaticle Discord channel

This post has been written for the third kglib pre-release, using TypeDB commit 20750ca0a46b4bc252ad81edccdfd8d8b7c46caa, and may subsequently fall out of line with the repo. Check there for the latest!