Deep Learning approaches to understand Human Reasoning
For a doctor who is using Deep Learning to find whether the patient has multiple sclerosis, it is not at all good to get a yes or no answer from the model. For a safety critical application such as autonomous cars , it is not enough to predict the crash. There is an urgent need to make machine learning models reason its assertions and articulate it to humans. Visual Question Answering work by Devi Parikh, Druv Batra [17] and work on understanding visual relationships by Fei-Fei Li team [16] are few leads to achieve it. But there is long way to go in terms of learning reasoning structures. So in this blog, we are going to talk about, How to integrate Reasoning into CNNs and Knowledge Graphs.
For a long time, reasoning has been understood to be a bunch of deductions and inductions. The study of abstract symbolic logic allowed canonicalization of these concepts as described by John Venn [1] in 1881. It’s like those IQ Tests that we take. A implies B, B implies C, so A implies C. etc. Think of it as a bunch of logical equations.
But later, this idea of fixed induced/deduced reasoning was dismantled in 1975 by L.A. Zadeh [2], where he describes the concept of approximate reasoning. It also introduces term called linguistic variable(age=young,very young, quite young, old,quite old,very old) as opposed to numeric variables(age=21,15,19,57,42,72) which forms the bedrock for establishing fuzzy Logic via words [3]. This is a standardization which takes care of fuzziness or ambiguity in the reasoning.
For example, in our day to day language, we don’t say “I am talking to a 21-year-old male of height 173 cm”, I would say “I am talking to a tall young guy” . Fuzzy Logic is, therefore, taking into consideration the vagueness of the argument for constructing the reasoning models.
In Spite of incorporating fuzziness, it couldn’t capture the essence of Human Reasoning. One of the explanations can be that, apart from simple deductions like “A is not B, B is C, means A is not C”, there is an overwhelming element of implicit reasoning in case of the Human Rationale. Within a flash, humans can deduce things without going through the sequence of steps. Sometimes it is instinctive too. If you have a pet dog, then you know what it does when you snatch the toy from its mouth.
Humans display a phenomenal ability to abstract and improve the explicit forms(One Shot, Differentiable Memory) of reasoning over time. This means it is not concocted in purely statistical forms. Statistical Learning [4] based Language Model is an example of Implicit learning, where we do not use any rules, propositions, fuzzy logic. We instead allow the temporal models to learn the long range dependencies [5] [6] . You can imagine this as an autocomplete feature in phones.
You can either train reasoning structures to predict the most logical phrase or let the statistical methods predict a probabilistically suitable completion phrase.
These kind of models are unable to work for rare occurrences of words or images because they forget this information due to rarity. It also fails to generalize a concept. For instance, if we see one type of cow, we are able to generalize our learning to all the other types cows. If we hear an utterance once, we are able to recognize its variant in different accents, dialects and prosody.
One-shot Learning [7] paves the way to learn a rare event based on our ability to make use of past knowledge no matter how unrelated it is. If a person has only seen squares and triangles from birth , like the infamous cat experiment [8] ), and then exposed to a deer for the first time, a person will not just remember it as an image, but also sub-consciously store its similarity w.r.t squares and triangles. For one-shot learning, a memory bank becomes imperative. Memory has to interact with the core model, to make it learn efficiently and reason faster.
I know that you might be struggling with this term One Shot. And so, here’s a simple example where we use Imagenet for One Shot Learning. Now think of the 1000 classes of imagenet like monkey, humans, cars etc as Judges in a reality show. Each of them giving a score based on how likely it is to be a monkey or a human and so on.
Let’s assume that there is a 1001 th class for which the model is not trained. If I take two items from this class, then none of them would give a confident score, but if we look at this 1000 vector score for both items, they may have similarity. For example Galapagos Lizard, may get upvotes from crocodile and lizard more than any other class judges. The Judges are bound to give similar scores to images of Galapagos Lizard, even though it is not in the class list and without having even a single image in the training data. This feature similarity based clubbing is the simplest form of One Shot Learning.
Recent work on Memory augmented Neural Networks by Santoro [9] considers automating the interaction with memory via differentiable memory operations, inspired by Neural Turing Machines [10].
So the network learns to decide the feature vectors, which it considers useful, to be stored into differentiable memory block along with a class it has never seen. This representation keeps evolving. It gives the neural network the ability to learn “how to learn quickly”, which is why they call it meta-learning. So it starts behaving more like humans. Our ability to relate past with present is fantastic. For instance “If I have not seen this weird alien creature, I can still say that it looks more like a baboon or a gorilla with horns of a cow.
The key takeaway from this discussion is that
1. Purely explicit reasoning based on fuzzy logic fails to capture the essence of human reasoning.
2. Implicit models like Traditional One-Shot Learning, are incapable of generalizing or learning from the rare events, by itself. It needs to be augmented with memory.
This structure of augmented memory can either be in form of an LSTM as presented by cho , sutskever[5] [6] or it can be a dynamic lookup table like santoro’s recent work. [9]. This dynamic lookup can further be enriched based on external knowledge graphs [11] as proposed by Sungjin from Bengio’s Lab. This is called Neural Knowledge Language Model.
Let’s say you want to learn how to complete an incomplete sentence. Now I can do this via simple Sequence to Sequence model. But it will not be good because of rare occurrence of named entity. It would have rarely heard “Crazymuse” before. But if we learn to fetch the named entities from Knowledge graphs, by identifying the topic or relation and also identifying whether to fetch from LSTM or from Knowledge graph, then we can make it complete the sentence with even rare named entities. This is a really awesome way to combine the power of rich Knowledge Graphs and Neural Networks. Thanks to Reddit ML group, and “What are you reading” thread that I got to read a curated set of papers.
Now what we just learnt opens up a host of possibilities in terms of reasoning and inference because knowledge representation (subject,predicate,object) allows us to perform more complex reasoning tasks similar to explicit fuzzy logic along with implicit statistical learning.
This ability to learn retrieval from Knowledge Graphs along with attention mechanism [12] [13] can lead towards explainable models.
Availability of question answering datasets such as SQUAD [14] [15] has helped make significant strides towards inferable language models. Recent works in Visual Question Answering [16] [17] [18] use datasets such as Visual Genome [19] ,CLEVR [20] and VRD [21] in order to translate an image to an ontology and learn visual relations for improved scene understanding and inference.
But again, despite the advances in question answering based on context on scene understanding, there are a few limitations
1. The use of LSTMs for memory based models and learning attention shifts in terms of visual relationships[16] definitely improves the understanding of context and ability to generalize. But there is a lot more to be done in terms of learning and improving canonical forms of reasoning.
2. The suffocation of using a structured pipeline of convolution neural network makes the model opaque to human interpretation. It may be suitable for basic classification and domain specific generative tasks, but not designed for reasoning. Instead if only we could directly learn richer multi-modal entity representation in Knowledge graphs and ontology as proposed in “Never Ending Learning” by Tom Mitchell [24]. Then we could learn cross domain reasoning structures and force the model to articulate its understanding of entity-relationship better.
I dream of a day, when machines will learn to reason. A day when we could ask Machines, why do you think this guy has multiple sclerosis. And then, it could find words to articulate its reasoning. I am aware of Naftali’s work on Information Bottleneck Principle and Michell’s [24] work on Never Ending Learning. But what’s missing is actively learning abstractions over basic reasoning structures provided by fuzzy logic.You can drive it by learning optimal policy via rewards, or some kind of validation based on one shot learning principles or some semi-supervised graph based approaches. But whatever be the driving factor, the model needs to learn to improve reasoning. The model needs to learn to associate this reasoning engine with rich feature representations coming from sounds or images which may even catalyze a loop of “improving representation, improving reasoning, improving representation, improving reasoning” just like Policy Iteration. And most importantly, the model should be able to articulate the abstractions to humans and say, “Hey humans, I think cats are cute, because their eyes resemble that of babies and are full of life unlike your monotonous routine ”
Till then, let’s keep training models and keep dreaming of the day the model is up and running. Because dreams are becoming reality faster than you can imagine!
About Me
Jaley is a youtuber and creator at Edyoda (www.edyoda.com). He has been a senior datascientist at Harman in the past and is super-curious to know the structure of Human Reasoning.
Show your support to independent well researched publications by giving claps.
My mini-course related to Knowledge Graphs
Course Link : Knowledge Graphs and Deep Learning
References
[1] John Venn. Symbolic logic. 1881.
[4] Eugene Charniak. Statistical language learning. MIT press, 1996.
[11] Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. A neural knowledge language model. arXiv preprint arXiv:1608.00318, 2016. [12] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.