towards Deep Relational Learning

What is Relational Machine Learning?

A dive into fundamentals of learning representations beyond feature vectors

Gustav Šír
Towards Data Science
12 min readFeb 8, 2022

--

Relational learning aims at learning from structured data with complex internal and/or external relationships. (Image from pixabay)

From AI to ML

All intelligent life forms instinctively model their surrounding environment in order to actively navigate through it with their actions. In Artificial Intelligence (AI) research, we then try to understand and automate this interesting ability of living systems with machine learning (ML) at the core.

  • Generally speaking, deriving mathematical models of complex systems is at the core of any scientific discipline. Researchers have always tried to come up with equations governing the behavior of their systems of interest, ranging from physics and biology to economics.

Machine learning then instantiates the scientific method of searching for a mathematical hypothesis (model) that best fits the observed data. However, thanks to the advances in computing, it allows to further automate this process into searching through large prefabricated hypothesis spaces in a heavily data-driven fashion. This is particularly useful in the modeling of complex systems for which the structure of the underlying hypothesis space is too complex, or even unknown, but large amounts of data are available.

  • To be fair, a similar approach called “model identification” has also been a traditional part of control theory, where parameters of differential equations, describing the dynamics of the underlying systems, are being estimated from input-output data measured from the, often also complex, systems in scope. With an accurate model, optimal control actions can then be derived to steer the system evolution towards desirable target measures, much in the spirit of what is now called AI (with less math and more hype).

Origins of Feature Vectors

While the approaches to the problem of mathematical modeling of complex systems evolved in various, largely independent, ways, one aspect remained almost universal — the data representation. Indeed, while the mathematical forms of the hypotheses and models have traditionally varied wildly, from analytical expressions and differential equations used in the control theory, all the way to decision trees and neural networks used now in ML, the input-output observations have traditionally been limited to the form of numeric vectors.

This only seems natural. Since the advent of computers, we have become very accustomed to turning any property of interest into a number, ranging from physical measurements, such as force or voltage, all the way to color, mood, or preference of ketchup over mustard.

Given a number of such input (X) and output (Y) variables that can be measured upon a system of interest, being anything from a power plant to a human, each such measurement then reduces to a particular vector of numbers, commonly referred to as a feature vector.

But there is also another good reason why feature vectors are highly appealing. Treating each data sample (measurement) as an independent point in an n-dimensional space allows to directly adopt the standard machinery of linear algebra, time-proven by hundreds of years of the preceding engineering from the other domains.

Classic machine learning with i.i.d. feature vectors (n-dimensional points) is “just” multivariate statistics... Image by the author.

Thanks to this representation of samples assumed to be independently identically drawn (i.i.d.) from some joint Pxy distribution, machine learning research could also build directly upon known statistical results from probabilistic concentration bounds (e.g., Hoeffding) to come up with the standard ML theory of “probably approximately correct” (PAC) learning. Consequently, much of the classic machine learning, at least when studied properly, falls under multivariate statistics.

As a result, any classic ML method now expects input data in the form of a table, where each column corresponds to a feature X or target variable Y, and each row corresponds to a single example measurement. The most general task is then to estimate the joint probability distribution Pxy that generated the observed data or, more commonly in supervised ML, to estimate just the conditional Py|x. These again are tasks that have been commonly studied in statistics for a long time now.

The Need for Relational Representations

We now got so used to pre-processing our data into such a table (or tensor) of numbers, expected as the input format in almost any ML library, that it might be hard to even imagine that this is not the all-encompassing data representation. However, just look around to see how the actual real-world data look like. It is not stored in numeric vectors/tensors, but in the interlinked structures of the internet pages, social networks, knowledge graphs, biological, chemical, and engineering databases, etc. These are inherently relational data that are naturally stored in their structured form of graphs, hypergraphs, and relational databases.

A large part of the real-world data is stored in relational databases. (Image from pixabay.)

But wait, can’t we just turn these structures into the feature vectors, and everything goes back to normal?

Well, people certainly did that for the aforementioned (convenience) reasons, and up until quite recently, this was by far the primary way to do ML with these data structures, often referred to as propositionalization. For instance, one could calculate various statistics upon the structures, such as counting out nodes, edges, or subgraphs from a graph (and also utilize various kernel methods operating upon these).

And from a practical perspective, there’s nothing wrong with crafting features from relational structures, but it is good to realize that there is a common cognitive bias underlying such an approach:

“If all you have is a hammer, everything looks like a nail.”

So, can we skip this phase of feature vector construction from relational data? If you’re now thinking “deep learning to the rescue”, it is important to realize that, up until very recently, all the classic deep learning methods were also restricted to representations in the form of fixed-size numeric vectors (or tensors). The idea behind deep learning is “only” to skip manual construction of “high-level” representation vectors from “low-level” (input) representation vectors, but you still need the latter to begin with!

And, from a theoretical viewpoint, there is a deep problem with turning relational data into the vector representation, since there is no way of mapping an unbound relational structure into any fixed-size representation without (unwanted) loss of information during this preprocessing (propositionalization) step.

Moreover, even if we restrict ourselves to fixed-size structures, designing a suitable representation in the form of a numeric vector or tensor is still deeply problematic. Take, for instance, even just the graph data, which is a special form of relational data. If there was a definite way to map a graph into the standard learning form of a (fixed-size) numeric vector (or tensor), it would trivially solve the graph isomorphism problem.

  • since to test if two graphs are isomorphic or not, it would suffice to turn them into such vectors and compare these for equality instead. Of course, we further assume that creating such vectors would be efficient (i.e. not NP-complete).

Hence, there is an inherent need for a fundamentally different learning representation formalism for data with such irregular topologies, outside the classic space of fixed-size numeric vectors (or tensors).
So, can we actually learn with the relational data representations, coming in the form of various networks, (knowledge) graphs, and relational databases?

Intermezzo on GNNs. Of course, by now you have probably heard about Graph Neural Networks (GNNs), recently proposed to tackle the graph-structured data — and don’t worry, we will get to these in a follow-up article! For now, just note that GNNs are one specific way of dealing with one form of the relational representation (a graph), rooted in one particular (very good) heuristic for the graph isomorphism problem (Weisfeiler-Lehman).
Now let’s continue with the broader perspective.

Relational Machine Learning

Much of the recent deep learning research was then about discovering models and learning representations capturing data in various forms of sets and graphs. However, it is only rarely acknowledged that these structured learning representations have for long been studied (as a special case) in Relational Machine Learning.

A relation, as you might recall, is a subset of a cartesian product defined over some sets of objects. Every set is thus simply a degenerated case of some (unary) relation. Every graph can then be seen as an instantiation of a binary relation over the same set of objects (nodes). A relation of a higher than binary arity then corresponds to a classic relational table, also known as a hypergraph. Add multiple such relations (tables) over the objects, and you have a relational database.

Much of the real-world data is then stored in such relational databases, which you have certainly encountered before. Now imagine that your learning samples are not prepared nicely as rows in a single table, but spread across multiple interlinked tables of the database, where different samples consist of different types and numbers of objects, with each object being characterized by a different set of attributes. This situation is actually far from uncommon in practice, but how on earth are you going to fit something like that into your favorite SVM/xGBoost/NN model?

While these data representations inherently fall outside the standard vector (tensor) formalism, there is actually another representation formalism that covers all these formats very naturally. It is relational logic.

Indeed, relational logic¹ is the lingua franca of all structured (relational) representations. In practice, many of the standard formats (e.g. ERM & SQL) designed for the structured data, ranging from sets to databases², follow directly from relational logic (and relational algebra).

And while you are probably already familiar with the relational logic/algebra formalism from your CS 101, it is quite likely that you have never heard of it in the context of machine learning. However, apart from being a great data manipulation and representation formalism, relational logic can also be used to directly tackle complex relational machine learning scenarios, just like the one outlined above.

Learning with Logic

Much out of the lights of the machine learning mainstream, there has been a community of Inductive Logic Programming (ILP), concerned with learning interpretable models from data with complex relational structures.

As outlined, ILP exploits the expressiveness of the relational logic formalism to capture these data structures (including relational databases and more). However, interestingly, relational logic here is also used to represent the models themselves. In ILP, these take the form of logical theories⁴, i.e. sets of logical rules formed from the used logical relations.

Moreover, ILP introduced the fundamental concept of background knowledge which can be, thanks to the logic-based representation, elegantly incorporated as a relational inductive bias directly into the models.

For decades [3] this, rather unorthodox, relational ML approach was then the premier venue for learning with data samples that do not succumb themselves to the standard form of i.i.d. feature vectors. This allowed ILP to explore some very general learning problems of manipulating structured data representations, involving variously attributed objects participating in relationships, actions and events, beyond the scope of standard statistical ML.

Example. For illustration, let’s see how the recently explored graph-structured learning problems can be approached with the relational logic. To represent a graph, we simply define a binary ‘edge’ relation, with a set of instantiations edge(x,y) for all adjacent nodes x,y in the graph. Additionally, we may also use other (unary) relations to assign various attributes to the sets of nodes, such as ‘red(x)’, etc.

An example of a labeled graph structure encoded in relational logic (left), and the two possible inferences of the query “path(d, a)” through the (learned) recursive model of a path (right). Image by the author (source).

The models, i.e. the logical rules, then commonly express relational patterns to be searched within the data. This covers all sorts of things from finding characteristic substructures in molecules to paths in a network. Thanks to the high expressiveness, declarative nature, and inherent use of recursion in relational logic, the learned models are then often very compact and elegant. For instance, a (learned) model perfectly capturing paths in a graph, such as your subway connection from X-to-Y, will commonly look like

path(X,Y) <= edge(X,Y).
path(X,Y) <= edge(X,Z), path(Z,Y).

which can be easily learned with a basic ILP system from but a few examples.

Interestingly, this is in direct contrast to, e.g., tackling the same problem with a Differentiable Neural Computer — one of the recent Deepmind’s highlights, which required a lot of examples and additional hacking (e.g., pruning out invalid path predictions) to tackle the task with an “inappropriate” tensor (propositional) representation (emulating a differentiable memory).

Statistical Relational Learning

While substantially more expressive in representation, learning with logic alone is not well suited for dealing with noise and uncertainty.⁵ To tackle the issue, many methods arose to merge the expressiveness of relational logic, adopted from ILP, and probabilistic modeling, adopted from classic statistical learning, under the notion of Statistical Relational Learning (SRL)⁶ which covers learning of models from complex data that exhibit both uncertainty and a rich relational structure. Particularly, SRL has extended ILP by techniques inspired in the non-logical learning world, such as kernel-based methods and graphical models.

Generally, there have been two major streams of approaches in SRL — probabilistic logic programming and lifted modeling, which will serve as a foundation for our further exploration of the deep relational learning concept.

Lifted models

As opposed to standard (aka “ground”) machine learning models, lifted models do not specify a particular computational structure, but rather a template from which the standard models are being unfolded as part of the inference (evaluation) process, given the varying context of the relational input data (and, possibly, also the background knowledge).

For instance, the (arguably) most popular lifted model — a Markov Logic Network (MLN) [7] may be seen as such a template for the classic Markov networks. For prediction and learning, an MLN is combined with a particular set of relational facts, describing the input data (e.g., a database), and unfolds a classic Markov network. Let’s take a closer look at that.

Example. For example, such an MLN template may express a general prior that “friends of smokers tend to be smokers” and that “smoking may lead to cancer”. The learning data may then describe a social network of people with a subset of smokers labeled as such. The lifted modeling paradigm of MLNs then allows to induce the smoking probabilities of all the other people based on their social relationships, as if modeled by a regular Markov network, yet systematically generalizing over social networks of diverse structures and sizes!

A Markov Logic Network template, encoding an intuition about smoking habits, unfolded, given two people {a,b}, into a standard Markov network with shared weights. Image by the author (source, inspired by [7]).

Importantly, this also allows the lifted models to capture the inherent symmetries in learning problems⁸, such as the regularity of the friendship relation across all the different people in the network, by tying their parameters.

This parameter sharing can then significantly reduce the number of weights that have to be learned, and allow the lifted models to convey a highly compressed representation of the problem, since all the regular relational patterns (symmetries) are parameterized jointly by the single template. This in turn allows for better generalization.

  • Additionally, exploiting the symmetries with this lifted modeling approach can also significantly speed up the evaluation itself, which is commonly known as “lifted inference” in SRL.

There have been many interesting concepts proposed within the SRL community, however, with the increased expressiveness, they often also inherited the problematic computational complexity of ILP. Also, being focused mostly on the probabilistic (graphical) models, the developed systems commonly lack the efficiency, robustness, and deep representation learning abilities of neural networks we got so used to enjoying.

Consequently, their usage in real-life applications is far from where we see deep learning these days. The neural models, on the other hand, have still been vastly limited to the fixed-size tensor (propositional) representations which, as explained in this article, cannot correctly capture the unbound, dynamic and irregular nature of the structured learning representations, for which the relational logic formalism is the natural choice.

In the next article, we will then browse into the history of “Neural-Symbolic Integration” aimed at combining symbolic logic with neural networks. This will provide some further background for our path towards the desired unification of modern structured deep learning models, such as Graph Neural Networks, with relational logic into an expressive “Deep Relational Learning”.

1. Often also referred to as predicate or first-order logic (which additionally also introduces logical function symbols that will not be needed here).

2. Relational logic is not even limited to the relational databases, but further allows to cover all the rich knowledge bases and fancy deductive ontologies.

[3] Cropper, A., Dumančić, S., Evans, R. et al. Inductive logic programming at 30. Mach Learn (2021). https://doi.org/10.1007/s10994-021-06089-1

4. These form the basis of logic programming in languages like Datalog and Prolog, for which this ML approach is conceptually close to the field of program synthesis (due to the high expressiveness of the learned models).

5. The uncertainty in relational learning naturally arises from the data on many levels, from the classic uncertainty about the values of attributes of an object to uncertainty about its type(s), membership within a relationship, and the overall numbers of objects and relations in scope.

6. Some other terms for this research domain include multi-relational learning/data-mining or probabilistic logic learning. Structured prediction models can also be seen as an instance of SRL.

[7] Richardson, Matthew, and Pedro Domingos. “Markov logic networks.” Machine learning 62.1–2 (2006): 107–136.

8. Later also extrapolated (by Pedro Domingos, again) to neural networks in:
Gens, Robert, and Pedro Domingos. “Deep symmetry networks”. Advances in neural information processing systems 27 (2014): 2537–2545.

--

--