Neural Networks: Error-Prediction Layers

Published in

Towards Data Science

6 min readJul 24, 2017

Jeff Hawkins, waaay back in 2005, wrote “On Intelligence” — about a peculiar finding in human neuroscience which has yet to be utilized by Deep Learning. It deserves a closer look.

Humans, dolphins, and monkeys have brains unlike other creatures: in our frontal lobes, we have numerous adjacent stacks of multiple types of neurons — like a breakfast table covered in plates, each with its own pile of pancakes, toppings interspersed. And, these stacks of neurons function in a peculiar way:

The lowest layer is given the task of predicting the next moment; it gets a dopamine rush, and strengthens its connections, whenever it knows the future.
The next layer is given the task of predicting when the lower layer is wrong; it gets a rush, and reinforces its learning, every time that it knows the future about the lower layer.
Further layers are also given the task of predicting when their lower layer is in error; they learn to be error-detectors.

Humans have six of these layers in our ‘pancake stacks’, and we have a ‘breakfast table’ with millions of plates, to form our higher reasoning and self-reflective intelligence. This model is completely different from what artificial neural networks do, today.

But, Deep Neural Networks work so well…

Yes, they do. And, so do GANs, and LSTM networks. Those methods quickly settle upon good heuristics for compression of data into features. Yet, our brains do more than that.

Let’s compare: a Recurrent Neural Network, and the human brain. RNNs receive the current state of the environment (its ‘input’ is the pixels on the screen) and generate an action within that environment (the ‘output’ is in the space of possible actions, while the ‘input’ was in the space of possible pixel arrangements). However, the human frontal lobe has a lowest-layer ‘pancake’, which takes the present state of the system (the ‘inputs’) and tries to predict the future state of the system (the ‘output’ is over the same space as the ‘input’, and the loss function is the difference between the two). The human brain isn’t trying to ‘choose the best move in the game’, the way an RNN would! We’re actually trying to predict what happens next.

Additionally, an RNN may have many ‘layers’ of neurons, but it represents only a single ‘layer’ of functionality. Pixels in → Action out. Our own ‘pancake’ stacks actually operate like stacks of multiple neural networks. Each ‘pancake’ takes State in → State Prediction out. Higher-layer ‘pancakes’ function as error-detectors because the state that they observe is the map of differences between their lower layer’s ground truth and its prediction. Higher layers only see the pixels where the lower layer made the wrong prediction.

If we wanted a neural network to behave like a frontal lobe, it would need to map Pixel inputs →Pixel Prediction output, and that output would be compared to the ground truth pixels in the following moment, to see where the predictions were wrong. That would be the lowest ‘pancake’. Then, the pixels that were mis-predicted would be the inputs for the next ‘pancake’, and that ‘pancake’ tries to predict where the next moment’s errors will be. That second ‘pancake’ would be a new neural network.

We would need six of these neural networks, stacked on top of each other — and to make things more complicated, the higher-layer ‘pancakes’ would also receive signals from multiple feature-detector neurons many layers down the stack. Our sixth ‘pancake’ would receive the errors of the fifth ‘pancake’ as input, as well as some feature-detector signals from the first, second, third, and fourth pancake! That is unlike the existing artificial neural networks, and I think that the difference is important.

How does it Help?

Currently, researchers expect a single neural network to get things right every time. That’s not how the brain works. In our six ‘pancakes’, the first ‘pancake’ neural network gets things wrong often. If we trained an artificial neural network to be like the lowest ‘pancake’, we would need to slow the learning rate early, and halt long before over-fitting. Our network would still get many answers wrong.

Then, we would need a second ‘pancake’, a second neural network, that receives the lower layer’s mistakes. That layer would also halt very early. It would likely still be mistaken about when the lowest layer will make a mistake. Only when many of these ‘pancakes’ are stacked together, does the error-rate drop significantly.

For number-minded folks: current NLP networks are in error about 4% of the time. Meanwhile, suppose that a ‘frontal lobe’ with six ‘pancakes’ had a lowest ‘pancake’ that was wrong 50% of the time. By itself, that ‘pancake’ would be much worse than our current networks. Yet, its erroneous instances are passed along to a second ‘pancake’. That pancake only looks for the 50% that were incorrect, and we can suppose it corrects 50% of those errors. So far, those two ‘pancakes’, taken together, are correct 75% of the time. With six of those ‘pancakes’, each one correcting just half the remaining errors, the combined accuracy is 98.4375%! So, the stacks of error-detectors can quickly outperform an end-to-end network, even when each error-detector is ‘faulty’.

Predicting Errors by Growing Pancakes

Humans have a greater capacity for reason and reflection, and we also have more ‘pancakes’ on our plates! Dolphins have four, apes and monkeys have less. I expect that, if a machine had more ‘pancakes’ than us, each ‘pancake’ attempting to predict the errors of the ‘pancake’ beneath it, that machine would be more capable than us. This opens up a new direction for machine intelligence that learns as it goes along.

Learn-as-you-go: The machine would begin with a single deep neural network, and is given the task of predicting the next moment. When its success rate rises above some threshold, add a new deep neural network on top. That new network would receive the lower network’s mis-predictions, and would be tasked with predicting where the next errors will occur. When that network’s success rate rises above a threshold, add a new deep neural network. Continue this process, to successively improve the combination of networks.

With this ‘pancake’ paradigm, the network responds to new information by growing another deep neural network ‘pancake’ on top of all the old networks. Learning doesn’t ever stop. The stack of ‘pancakes’ just gets taller and taller. This concept becomes even more important, when combined with the Mixture of Experts variety of neural network.

Mixture of Experts

In a Mixture of Experts neural network, neurons are ‘clustered’ like raspberries, into bundles of dense connections. And, like raspberry jam, there are a few long-range connections that ‘glue’ all the raspberries together. Currently, Mixture of Expert models stop there. When an input enters at the bottom of the jam-pile, it activates just a few of the ‘raspberries’, and each of those ‘raspberries’ performs a little bit of feature-detection. Those features activate a few of the ‘raspberries’ that are higher in the jam-pile, where higher-level features are detected.

Moving up the jam-pile, these expert raspberries are able to discover features in diverse inputs; for each subset of inputs, a different set of experts went to work. The Mixture of Experts network behaves like the union of many sparse networks, where each raspberry is the intersection of some of those networks.

(In a crowded Venn diagram of sparse networks, each network has feature-detectors which overlap a little bit with any other network; because there are so many networks together, the combined diagram allows most sparse networks to share most of their feature-detectors with someone else. For example, sparse network #1 might utilize feature-detector ‘raspberries’ A, B, and C. Meanwhile, sparse network #2 uses A, D, and E. Network #3 needs the ‘raspberries’ that detect features C, D, and E. So, you can combine these three sparse networks into a Mixture of Experts that detects all of the features: A, B, C, D, and E. Each sparse network overlaps each other sparse network in just a few places. But, taken together, all the experts have overlap.)

All the Toppings

Back to the ‘pancakes’ that compose our frontal lobes. For it to match our own brains, the pancake metaphor needs to be even more elaborate: our neurons exhibit links across the stacks, from one plate to another. This is like a breakfast table covered in plates, with six ‘pancakes’ stacked on each plate… and raspberry jam smeared across all the stacks, dripping down their sides, and touching each plate to every other plate! Our brains are messy.

If we want an artificial neural network to learn and grow like ours, it will need multiple ‘pancakes’ of error-predicting deep NNs, where each of those deep NNs is composed of a layered Mixture of Experts. Each ‘pancake’ network receives as input the errors of the ‘pancake’ below it, as well as some of the features that were detected by the ‘pancakes’ adjacent to it. That’s wildly different from current deep NN architectures. And, it’s worth a try.

Neural Networks: Error-Prediction Layers

Written by Anthony Repetto