Towards Deep Relational Learning

What is Neural-Symbolic Integration?

A closer look into the history of combining symbolic AI with deep learning

Gustav Šír

Published in

Towards Data Science

14 min readFeb 14, 2022

Neural-Symbolic Integration aims primarily at capturing symbolic and logical reasoning with neural networks. (Image from pixabay)

For almost a decade now, deep learning has been the moving force behind most of the progress, success, and hype surrounding the AI landscape. It has taken over the field so rapidly that many people commonly confuse the domains as being equivalent now.

Artificial Intelligence ⊃ Machine Learning ⊃ Deep Learning

And while the current success and adoption of deep learning largely overshadowed the preceding techniques, these still have some interesting capabilities to offer. In this article, we will look into some of the original symbolic AI principles and how they can be combined with deep learning to leverage the benefits of both of these, seemingly unrelated (or even contradictory), approaches to learning and AI.

A Glimpse into the History of AI

Historically, the two encompassing streams of symbolic and sub-symbolic stances to AI evolved in a largely separate manner, with each camp focusing on selected narrow problems of their own. Originally, researchers favored the discrete, symbolic approaches towards AI, targeting problems ranging from knowledge representation, reasoning, and planning to automated theorem proving.

While the particular techniques in symbolic AI varied greatly, the field was largely based on mathematical logic, which was seen as the proper (“neat”) representation formalism for most of the underlying concepts of symbol manipulation. With this formalism in mind, people used to design large knowledge bases, expert and production rule systems, and specialized programming languages for AI.

These symbolic logic representations have then also been commonly used in the machine learning (ML) sub-domain, particularly in the form of Inductive Logic Programming (discussed in the previous article), which introduced the powerful ability to incorporate background knowledge into learning models and algorithms.

Amongst the main advantages of this logic-based approach towards ML have been the transparency to humans, deductive reasoning, inclusion of expert knowledge, and structured generalization from small data.

However, there have also been some major disadvantages including computational complexity, inability to capture real-world noisy problems, numerical values, and uncertainty.
Due to these problems, most of the symbolic AI approaches remained in their elegant theoretical forms, and never really saw any larger practical adoption in applications (as compared to what we see today).

Meanwhile, with the progress in computing power and amounts of available data, another approach to AI has begun to gain momentum. Statistical machine learning, originally targeting “narrow” problems, such as regression and classification, has begun to penetrate the AI field.

This only escalated with the arrival of the deep learning (DL) era, with which the field got completely dominated by the sub-symbolic, continuous, distributed representations, seemingly ending the story of symbolic AI.

The Rise of Deep Learning

The concept of neural networks (as they were called before the deep learning “rebranding”) has actually been around, with various ups and downs, for a few decades already. It dates all the way back to 1943 and the introduction of the first computational neuron [1]. Stacking these on top of each other into layers then became quite popular in the 1980s and ’90s already. However, at that time they were still mostly losing the competition against the more established, and better theoretically substantiated, learning models like SVMs.

The true resurgence of neural networks then started by their rapid empirical success in increasing accuracy on speech recognition tasks in 2010 [2], launching what is now mostly recognized as the modern deep learning era. Shortly afterward, neural networks started to demonstrate the same success in computer vision, too.

Facing the undeniable effectiveness of neural networks on these standard benchmarks for machine perception, researchers slowly (and sometimes reluctantly) started abandoning their advanced feature extraction pipelines designed for the SVMs to adopt the new practice of neural architecture crafting instead.

With this paradigm shift, many variants of the neural networks from the ’80s and ’90s have been rediscovered or newly introduced. Benefiting from the substantial increase in the parallel processing power of modern GPUs, and the ever-increasing amount of available data, deep learning has been steadily paving its way to completely dominate the (perceptual) ML.

The typical (symmetric) pattern of shared weights in a convolutional neural network ingesting tensor samples. Image by the author.

Intermezzo on CNNs. One of the most successful neural network architectures have been the Convolutional Neural Networks (CNNs) [3]⁴ (tracing back to 1982’s Neocognitron [5]). The distinguishing features introduced in CNNs were the use of shared weights and the idea of pooling.
The shared weights induced by application of convolutional filters here introduce equivariance w.r.t. the respective transformation of the filter, while incorporating the aggregation function on top via pooling extends it further into this transformation invariance. This technique has since proved extremely useful in various tasks involving translation (shift) symmetries.
(We will further explore these concepts in more detail in a follow-up article.)

Differentiable Programming

Driven heavily by the empirical success, DL then largely moved away from the original biological brain-inspired models of perceptual intelligence to “whatever works in practice” kind of engineering approach. In essence, the concept evolved into a very generic methodology of using gradient descent to optimize parameters of almost arbitrary nested functions, for which many like to rebrand the field yet again as differentiable programming. This view then made even more space for all sorts of new algorithms, tricks, and tweaks that have been introduced under various catchy names for the underlying functional blocks (still consisting mostly of various combinations of basic linear algebra operations).

But with this evolution, we have also started seeing renewed interest from the deep learning community in the structured, program-like, symbolic representations, albeit in a perhaps unintended manner. Indeed, the modern neural architectures are no longer about stacking more and more fully-connected layers over bigger and bigger datasets, but an incremental amount of prior knowledge in some form of structural bias is being incorporated into the differentiable programs in order to reach to the more complex tasks requiring higher levels of abstraction.⁷

Exactly how much of this prior knowledge to include in the models has then been a subject of many heated debates between researchers in AI.⁸

Neural-Symbolic Integration

While slowly saturating at super-human perception levels, we have seen DL expanding further into the most varied domains that were originally actually considered of a rather symbolic nature — starting with game playing and language modeling, and progressing to programming, algorithmic reasoning, and even theorem proving.

While it might now seem that with enough tweaking, large neural networks are on their way to solve every AI problem at hand, it might be good to point out that the concept of a neural network itself lacks many of the capabilities deemed elementary to AI systems, such as capturing relations and compositional structure, symbolic reasoning with abstract concepts, robustness, and transparency [9] (see posts from Gary Marcus on the topic here on medium!).

These abilities are, on the other hand, naturally provided by the logic-based methods. Hence, attempts for an integration of these two complementary AI streams in an efficient manner have thus become of great interest to researchers.¹⁰

It has now been argued by many that a combination of deep learning with the high-level reasoning capabilities present in the symbolic, logic-based approaches is necessary to progress towards more general AI systems [9,11,12].

While the interest in the symbolic aspects of AI from the mainstream (deep learning) community is quite new, there has actually been a long stream of research focusing on the very topic within a rather small community called Neural-Symbolic Integration (NSI) for learning and reasoning [12].

NSI has traditionally focused on emulating logic reasoning within neural networks, providing various perspectives into the correspondence between symbolic and sub-symbolic representations and computing. Historically, the community targeted mostly analysis of the correspondence and theoretical model expressiveness, rather than practical learning applications (which is probably why they have been marginalized by the mainstream research).

However, given the aforementioned recent evolution of the neural/deep learning concept, the NSI field is now gaining more momentum than ever.

From Symbols to Neurons

Perhaps surprisingly, the correspondence between the neural and logical calculus has been well established throughout history, due to the discussed dominance of symbolic AI in the early days.

Looking again — a bit closer — at the first proposal of a computational neuron from the 1943's paper “A logical calculus of the ideas immanent in nervous activity” by McCulloch and Pitts [1], we can see that it was actually thought to emulate logic gates over input (binary-valued) propositions. The idea was based on the, now commonly exemplified, fact that logical connectives of conjunction and disjunction can be easily encoded by binary threshold units with weights — i.e., the perceptron, an elegant learning algorithm for which was introduced shortly.

The logical AND and OR functions can be represented simply with a single thresholding neuron (perceptron). (Image from [13] by the author’s student Martin Krutsky)

The inability of this linear perceptron to compute the logical XOR function has then been seen as an important limitation, which had a profound negative impact on the field. It was known that this could be easily resolved by stacking multiples of these perceptrons on top of each other, enabling to represent even more complicated logical nested functions. However, an efficient learning algorithm for the weights of such “neural networks” was missing at the time, leading to the majority of researchers (and funding) abandoning this idea of connectionism in favor of the symbolic, and other statistical methods.¹⁴

The XOR function, which can be thought of as a compound of OR and NAND, is only representable by stacking the neurons into a neural network with a hidden layer. (Image from [13] by the author’s student Martin Krutsky)

It wasn’t until the 1980’s, when the chain rule for differentiation of nested functions was introduced as the backpropagation method to calculate gradients in such neural networks which, in turn, could be trained by gradient descent methods. For that, however, researchers had to replace the originally used binary threshold units with differentiable activation functions, such as the sigmoids, which started digging a gap between the neural networks and their crisp logical interpretations.

From Logic to Deep Learning

These old-school parallels between individual neurons and logical connectives might seem outlandish in the modern context of deep learning.

However, interestingly, even the modern idea of deep learning was not originally described as bound to the neural networks only, but rather universally to “methods modeling hierarchical composition of useful concepts”, that are reused in different inference paths of target variables from the input samples [15].

And while these concepts are commonly instantiated by the computation of hidden neurons/layers in deep learning, such hierarchical abstractions are generally very common to human thinking and logical reasoning, too.

In fact, logic is the very science of deduction of useful concepts from simpler premises in a hierarchical (nested) fashion. And while constructing a proof in logic, auxiliary lemmas are often created to reduce the complexity of the theory in scope in a very similar fashion.

Thus, while the hierarchical levels of abstraction are typically presented by the hidden layers of neural networks, they may also be thought of as “complicated propositional formulae re-using many sub-formulae”
(quotation from the abstract of “Learning Deep Architectures for AI” by Y. Bengio [15]).

And from the NSI perspective, this is not just some metaphor, but a number of actual systems reflecting this paradigm have continued to explicitly demonstrate the correspondence between the hierarchical structure of logical inference and classic neural networks throughout the ’90s, such as the popular Knowledge-Based Artificial Neural Network (KBANN) [16].

Sample of the KBANN method: (a) a propositional rule set; (b) the rules viewed as an AND-OR dependency graph; each proposition is represented as a unit (extra units are also added to represent disjunctive definitions, e.g., b), and their weights and biases are set so that they implement AND or OR gates, e.g, the weights b->a and c->a are set to 4 and the bias (threshold) of unit a to -6; (d) low-weighted links are added between layers as a basis for future learning (e.g., an antecedent can be added to a rule by increasing one of these weights). Image with original caption from [17] by Richard Maclin.

KBANN is a hybrid learning system built on top of connectionist learning techniques that maps, in the presented spirit, problem-specific “domain theories”, represented by propositional logic programs, into feed-forward neural networks, and then refines this reformulated knowledge using backpropagation.

This idea has also been later extended by providing corresponding algorithms for symbolic knowledge extraction back from the learned network, completing what is known in the NSI community as the “neural-symbolic learning cycle”.

The concept of a neural-symbolic learning cycle. Firstly, data (D) are used to create or learn (P) a symbolic model (S). This is then translated (R) into a neural network (N), upon which structure (T) and weight (W) learning refinements are performed, and a symbolic model is extracted (E) back. The cycle can then continue to iteratively improve the models. Image by the author’s colleague Martin Svatos from [18], where the idea was experimentally evaluated.

However, as imagined by Bengio, such a direct neural-symbolic correspondence was insurmountably limited to the aforementioned propositional logic setting. Lacking the ability to model complex real-life problems involving abstract knowledge with relational logic representations (explained in our previous article), the research in propositional neural-symbolic integration remained a small niche.

Problems with Relational-Level NSI

While the aforementioned correspondence between the propositional logic formulae and neural networks has been very direct, transferring the same principle to the relational setting was a major challenge NSI researchers have been traditionally struggling with. The issue is that in the propositional setting, only the (binary) values of the existing input propositions are changing, with the structure of the logical program being fixed.

This is easy to think of as a boolean circuit (neural network) sitting on top of a propositional interpretation (feature vector). However, the relational program input interpretations can no longer be thought of as independent values over a fixed (finite) number of propositions, but an unbound set of related facts that are true in the given world (a “least Herbrand model”). Consequently, also the structure of the logical inference on top of this representation can no longer be represented by a fixed boolean circuit.

Hence, the core obstacle is, again, in encoding the possibly infinite variations of relational inference structures into some finite, fixed structure of a neural network.¹⁹

This inability of classic neural networks to capture logic reasoning beyond the propositional expressiveness was then often referred to as propositional fixation, coined by John McCarthy [20].²¹

Since then, there has been a number of theoretical ideas proposed in the NSI community to deal with the unbound nature of relational logic, however, these were far more exotic²² than the knowledge-based approach to NN modeling outlined above, and never really saw some widespread adoption in practice.²³

From a more practical perspective, a number of successful NSI works then utilized various forms of propositionalisation (and “tensorization”) to turn the relational problems into the convenient numeric representations to begin with [24]. However, there is a principled issue with such approaches based on fixed-size numeric vector (or tensor) representations in that these are inherently insufficient to capture the unbound structures of relational logic reasoning. Consequently, all these methods are merely approximations of the true underlying relational semantics.

Note that this issue is intimately related to the problem of relational learning discussed in the last article. It is not surprising then that to capture relational logic expressiveness with neural networks, a lot of NSI researchers turned to similar techniques that ML practitioners used to deal with relational data representations (e.g., the propositionalization).

Dynamic Neural Computation

However, in the meantime, a new stream of neural architectures based on dynamic computational graphs became popular in modern deep learning to tackle structured data in the (non-propositional) form of various sequences, sets, and trees. Most recently, an extension to arbitrary (irregular) graphs then became extremely popular as Graph Neural Networks (GNNs).

These dynamic models finally enable to skip the preprocessing step of turning the relational representations, such as interpretations of a relational logic program, into the fixed-size vector (tensor) format. They do so by effectively reflecting the variations in the input data structures into variations in the structure of the neural model itself, constrained by some shared parameterization (symmetry) scheme reflecting the respective model prior.

And while the particular dynamic deep learning models introduced so far, such as the GNNs, are still just somewhat specific graph-propagation heuristics rather than universal (logic) reasoners, the paradigm of dynamic neural computation finally opens door to properly reflect relational logic reasoning in neural networks, in that classic spirit of the propositional NSI (e.g., the KBANN) outlined above.²⁵

In the next article, we will then explore how the sought-after relational NSI can actually be implemented with such a dynamic neural modeling approach. Particularly, we will show how to make neural networks learn directly with relational logic representations (beyond graphs and GNNs), ultimately benefiting both the symbolic and deep learning approaches to ML and AI.

[1] McCulloch, Warren S., and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity.” The bulletin of mathematical biophysics 5.4 (1943): 115–133.

[2] Hinton, Geoffrey, et al. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.” IEEE Signal processing magazine 29.6 (2012): 82–97.

[3] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278–2324.

4. We put emphasis here on CNNs, as their principles will be further important in the follow-up article(s).

[5] Fukushima, Kunihiko, and Sei Miyake. “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition.” Competition and cooperation in neural nets. Springer, Berlin, Heidelberg, 1982. 267–285.

[6] Weng, Juyang, Narendra Ahuja, and Thomas S. Huang. “Cresceptron: a self-organizing neural network which grows adaptively.” [Proceedings 1992] IJCNN International Joint Conference on Neural Networks. Vol. 1. IEEE, 1992.

7. Note the similarity to the use of background knowledge in the Inductive Logic Programming approach to Relational ML here.

8. While many sworn deep learning proponents would argue that there should be no such prior bias, and all the knowledge should be learned directly from raw data, based on the precedent of deep learning dominating feature engineering, it is perhaps instructive to look at the CNNs, encoding exactly such a structural bias in the form of the translation invariance.

[9] Marcus, Gary. “The next decade in ai: four steps towards robust artificial intelligence.” arXiv preprint arXiv:2002.06177 (2020).

10. However, the black-box nature of classic neural models, with most confirmations on their learning abilities being done empirically rather than analytically, renders some direct integration with the symbolic systems, possibly providing the missing capabilities, rather complicated.

[11] De Raedt, Luc, et al. “From statistical relational to neuro-symbolic artificial intelligence.” arXiv preprint arXiv:2003.08316 (2020).

[12] Besold, Tarek R., et al. “Neural-symbolic learning and reasoning: A survey and interpretation.” arXiv preprint arXiv:1711.03902 (2017).

[13] Martin, Krutský. Exploring symmetries in deep learning. BSc thesis. Czech Technical University in Prague, 2021.

14. Interestingly, we note that the simple logical XOR function is actually still challenging to learn properly even in modern-day deep learning, which we will discuss in the follow-up article.

[15] Bengio, Yoshua. Learning deep architectures for AI. Now Publishers Inc, 2009.

[16] Towell, Geoffrey G., and Jude W. Shavlik. “Knowledge-based artificial neural networks.” Artificial intelligence 70.1–2 (1994): 119–165.

[17] Maclin, Richard F. Learning from instruction and experience: Methods for incorporating procedural domain theories into knowledge-based neural networks. University of Wisconsin-Madison Department of Computer Sciences, 1995.

[18] Svatos, Martin, and Sourek, Gustav, and Zelezny, Filip. “Revisiting Neural-Symbolic Learning Cycle.” Neural-Symbolic Integration workshop @ IJCA, 2019.

19. Note the similarity to the propositional and relational machine learning we discussed in the last article.

[20] McCarthy, John. “Epistemological challenges for connectionism.” Behavioral and Brain Sciences 11.1 (1988): 44–44.

21. However, to be fair, such is the case with any standard learning model, such as SVMs or tree ensembles, which are essentially propositional, too.

22. For instance, one prominent idea was to encode the (possibly infinite) interpretation structures of a logic program by (vectors of) real numbers and represent the relational inference as a (black-box) mapping between these, based on the universal approximation theorem. However, this assumes the unbound relational information to be hidden in the unbound decimal fractions of the underlying real numbers, which is naturally completely impractical for any gradient-based learning.

23. We note that this was the state at the time and the situation has changed quite considerably in the recent years, with a number of modern NSI approaches dealing with the problem quite properly now.

[24] França, Manoel VM, Gerson Zaverucha, and Artur S. d’Avila Garcez. “Fast relational learning using bottom clause propositionalization with artificial neural networks.” Machine learning 94.1 (2014): 81–104.

25. Concurrently, the GNNs were also recently recognized as a promising future direction by the NSI community, too:
Lamb, Luis C., et al. “Graph neural networks meet neural-symbolic computing: A survey and perspective.” arXiv preprint arXiv:2003.00330 (2020).