Sketching a Proof of Convergence for Covariance-Learning in Neural Networks

Anthony Repetto
Towards Data Science
10 min readJun 9, 2017

--

(I am providing a lengthy, somewhat technical explanation that elaborates on my writings regarding covariance as a network-wide cost function that allows a network to train newly inserted neurons, especially for a Mixture of Experts neural network. This work is intended to facilitate development of machine intelligence which can learn from new experiences as they happen.)

First, some background:

Neural Networks are Powerful.

Neural networks are diagnosing cancers and translating between hundreds of languages. Complex classification tasks are now possible using deep neural networks. As we rely upon neural networks for increasingly sophisticated activities, we can expect these networks to expand to incredible depth. These networks have immense potential value. Yet, there are challenges that arise from training very deep neural networks…

Over-fitting is a Problem.

When a neural network is trained to classify data, it eventually learns to sort perfectly — but, it will only sort your training data perfectly because it memorizes the entire set. When given new data, these over-fit models perform poorly. Moving from a random network (which performs a random classification) to a memorized-data network (which performs an exact classification, i.e. it is ‘over-fit’) is a big leap, while good-generalization networks (which perform well on new data) sit at points near the memorization. Researchers halt their training once the classifier begins to worsen when given new data. That’s cludge, to work around the fact that minimizing the ‘loss-function’ is not what we really want. We actually want a good generalization!

Training-Time is Money.

A deeper neural network takes much longer to train, and often requires much larger data sets to learn well. You can use GPUs (or TPUs) to reduce time and cost, but that is only a one-off improvement. Fundamentally, training will balloon in cost as networks become deeper. Moore’s Law is dead, so the argument that ‘improvements in computer hardware will continue to make deep neural networks cheaper’ falls flat.

And, Learning needs to happen Everywhere.

Neural networks store learning across the depth of the network: simple, local features (like the location of an edge) are usually distinguished at lower layers, while higher layers classify more abstract features (e.g. ‘animal’, ‘tool’, ‘building’). Unfortunately, neural networks are trained using back-propagation from the output layer they learn from the top, down. So, the features that must be learned at low layers are only taught by signals that have diffused down through many layers. This is called the ‘vanishing gradient’ problem. Residuals and skip connections are fashionable work-arounds, but they do not eliminate the problem of top-down learning. (Residuals essentially amplify the gradient, while skip-connections effectively flatten the depth of the network; both still learn from the top layer, down.)

Covariance is the Cure

Specifically, negative covariance (and its cousin, negative correlation) is the solution. Negative covariance occurs when two signals tend to occur in opposition: if one is ‘on’, the other is ‘off’, and visa versa. For a neural network, negative covariance is registered if a neuron is ‘on’ when correctly classifying data, while it is ‘off’ when misclassifying data. (And, negative covariance is registered for the inverse condition, too: ‘off’ when correctly classifying data, while ‘on’ for misclassified data.) Neurons registering negative covariance are essentially error-detectors. They tell us: ‘if this neuron is firing, then we are likely to get the wrong answer’. We should listen to those neurons, to find which lessons have not been learned.

Covariance Eliminates Top-Down Learning

Neural networks today all suffer from top-down learning — their ‘loss-function’ or ‘cost-function’ is the error found at the output layer. (The loss function asks: “Did the neural network get the correct answer?”) Meanwhile, covariance can find errors at any layer, not just the top-most output layer! Negative covariance, being an error-detector, points specifically to the neurons ‘responsible’ for errors.

Said another way: if neural networks were businesses, the ‘loss-function’ is a memo that travels through many layers of management, before getting down to the employees. The ‘loss-function’ takes the result “Our business lost money!” and tells upper-management to “Do Something Differently”. Those managers then send a message to all the middle managers, insisting that they “Do Something Differently”. By the time that message gets to all the employees, what they should change is unclear. The memo was vague, to begin with, and became diluted as it traveled down the hierarchy.

Continuing the business metaphor: negative covariance is like a targeted memo, following a personal performance review. Covariance asks of each neuron/employee: “Did this employee do the work that caused our product to fail?” The entire network is under review, and the message “Do Something Differently” is only sent to the neurons/employees who DO need to do something differently!

So, while the loss-function is only defined at the output layer, covariance is defined at every neuron in the network. With covariance, learning happens wherever it’s needed, irrespective of layer.

Covariance Saves Time and Money

Training a very deep neural network adds time, because the ‘business memo’ becomes diluted as it travels through additional layers. In contrast, covariance targets its ‘memos’, and can give proscriptions to each neuron, regardless of depth. With covariance, extra layers don’t dilute the message, and so, they don’t elongate training time.

Covariance saves time on re-training. Traditional training creates a neural network that is static; if new data is included in the network’s training, it is often re-trained from scratch. Suppose that the new data required a change to some low-level features? The network would need to send numerous ‘memos’ down the hierarchy, before meaningful change would occur at the lower layer. Negative covariance avoids this problem, by immediately identifying the specific neurons responsible for errors, without needing to travel down a hierarchy.

As neural networks become deeper, they cannot be re-trained quickly using top-down learning. Frequent updates of deep networks become prohibitively expensive, and new data might arrive by the time that they are done training on the old data! Negative covariance eliminates the lag and cost of re-training, allowing incredibly deep neural networks to be updated with each new batch of data. For many applications, that capability is the difference between a toy model and a usable product.

Covariance Pushes Feature-Detection to the Lowest Layers, to Delay Over-fitting

This point takes a little effort to make: by applying back-propagation at neurons which exhibit negative covariance, the network is ‘pushed’ downwards, detecting features at the lowest layers possible. When features are identified at an earlier stage, it is harder for the network to stumble into a memorization. A proof of convergence requires that perfect accuracy is eventually reached, but this accuracy is less likely to be from memorization when the network relies on early feature-detection.

To show how all of this happens:

Imagine a neural network where the last layer is simply a copy of the layer below it; wires are connecting single neurons at full strength. The two layers’ responses are identical. So, the last layer is effectively irrelevant, with all the real computation occurring at lower layers. We can say that the classification problem has been ‘pushed down’ to the lower layers.

If we make a neural network with more layers than it needs, we would hope that training would result in the ‘pushing down’ of feature-detection, eliminating the extra layers. This is because extra layers allow the network to classify with greater specificity — we need enough layers to specify our classification, but too much specificity lets the network memorize (‘over-fit’) the training data.

So, if the classification task is ‘pushed down’ by our training algorithm, then the network is still able to attain accuracy for all the training data (that being the definition of ‘layers it needs’). Yet, it would not have the depth necessary to perfectly memorize the data; it would have only enough layers to learn effective classification. This property can be broadened to apply to all feature-detection at all layers: if the network is capturing those features at the lowest layer possible, then there is only enough depth to allow classification, not enough depth to allow perfect memorization. A network that ‘pushes’ feature-detection down to the lowest possible layers will delay over-fitting — and that is what I argue covariance achieves.

So, how does negative covariance ‘push down’ feature-detection?

If a neuron registers negative covariance, it is saying “I fire when the network is making a mistake”. To be able to identify mistakes, that neuron must be receiving data which are informative of a mistake-in-progress. That is to say: information that is reaching that neuron is involved in the mistake. (If this were NOT so, the neuron would be unable to detect errors, and so, it would not be registering negative covariance — the statement is tautological!) By back-propagating from that neuron, the neurons beneath it are discouraged, which is a discouragement of the information that led to errors. Any features which were active during mistakes would be discouraged, and as a result, the classification begins to rely upon features which were correct, because they are the only ones not discouraged.

This ‘discouragement’ applies at each layer; back-propagation discourages any feature detected at a lower layer that results in negative covariance at a higher layer. (This is exactly what the ‘loss-function’ does, at the output layer; the ‘loss-function’ is really just negative covariance restricted to the topmost layer. By generalizing this activity, negative covariance lets us do this at all layers.) A higher-layer neuron that was classifying correctly may have some of its inputs discouraged, because those inputs also travel to a neuron with negative covariance. That means the higher-layer neuron, by this discouragement, begins to listen more closely to the input neurons which were themselves classifying correctly. Taken to the extreme, the input neurons would be performing the correct classification, and transmitting that to the higher layer. At the topmost layer, this circumstance is equivalent to output neurons that fire when their lower layer input is a single, correctly-classifying neuron. That is what we defined as a ‘pushed down’ feature-detection!

So, a network that is trained using back-propagation from co-varying neurons will tend to discourage the activity of any layer’s neurons which are involved in mistakes. The neurons that remain active are those which are already correctly classifying features. The layers above those correct-classifiers are effectively redundant. The affect of this is: correct classification is ‘pushed down’ to the lower layer, and merely copied to the higher layer (unless the lower layer cannot completely classify on its own, in which case the higher layer neurons receive some combination of activities from the lower layer — an example of when ‘more layers are needed’).

A Sketch of the Proof of Convergence:

Without a proof of convergence, a training algorithm might jump around without ever attaining perfect accuracy on the training data. (It could attain perfect accuracy, but we wouldn’t have mathematical assurance!) I’ll introduce a few tricks, to wrangle convergence out of negative covariance:

Half-Learning

After running all the training data through our neural network, we separate the correctly-classified inputs from the misclassified ones, and measure the negative covariance between these two groups, for each neuron. We then apply gradient descent by back-propagation on each neuron, according to its degree of negative covariance. Suppose that, after applying that ‘learning’ from gradient descent, we find that the number of misclassifications has increased! In that case, cut the rate of ‘learning’ in half, and try again. Continue cutting the learning-rate in half, until the number of errors does not increase. Now that you have the same or fewer misclassifications, you measure negative covariance again… I’ll call this process of cutting the learning-rate in half the ‘half-learning’ of the network. It guarantees that the rate of error does not increase.

Stopping-Point of Half-Learning

When will half-learning halt? When even the slightest learning-rate would cause the number of errors to increase. Could this occur when the number of errors is positive? Answering that requires some close observation:

When there are still errors, some neurons exhibit negative covariance. That is, they tend to be ‘on’ during a misclassification, and ‘off’ during a correct classification, or visa versa. If the only neuron with negative covariance was one of the output neurons (the neuron that fired incorrectly, causing a misclassification), then the situation is equivalent to the traditional ‘loss-function’ with a misclassification at the output layer. This is already proven to converge in the existing literature.

Now, suppose that there is a misclassification (which necessarily includes negative covariance at one of the output neurons) that also shows negative covariance at one other neuron, in a deeper layer. If ‘half-learning’ can find a small change to the co-varying neuron that eliminates the error, without creating a new error, then the network still converges.

There exists a ‘half-learning’ rate which does eliminate the existing error, without creating a new error. This is because the misclassifier on the output layer must depend upon the erroneous neuron and its inputs more than the correct-classifiers do. (If the correct-classifiers depended upon the erroneous neuron more heavily than the misclassifier did, then they would necessarily also misclassify inputs. Correct-classifiers on the output layer must draw upon inputs which are predominantly correct, by definition!) Because the misclassifying neuron on the output layer depends upon the lower-layer erroneous neuron more than the other output neurons do, a change to that erroneous neuron will have a greater impact on the misclassifier than it does on the correct neurons… The misclassification can be eliminated, using some small ‘half-learning’ rate, without changing the correct neurons too much!

So, we have demonstrated convergence for the case of a single neuron at the output layer with negative covariance (this is just the traditional ‘loss-function’, with convergence proven in existing literature), as well as convergence in the case where an additional neuron at some lower layer exhibits negative covariance (the ‘erroneous neuron’ in the above paragraph). By induction, I offer that additional neurons at any layer which exhibit negative covariance are similarly constrained, and are similarly convergent, as a result. Each erroneous neuron must be more strongly associated with a misclassification at the output layer than it is with correctly-classifying neurons, and so, a small change to those erroneous neurons will effect the misclassifier more than they impact correct classifiers. This is a guarantee that errors will not increase, and that they can decrease to zero.

Is that all?

I do not intend to imply that covariance allows a network to always converge faster than existing methods. Instead, covariance enables training of inserted clusters of neurons, providing faster re-training times. You can grow a new cluster of neurons anywhere inside the existing network, and train it on new data, while the rest of the network’s weights are ‘frozen’. I suggest planting those clusters at the error-detecting neurons within a Mixture of Experts neural network. Each cluster essentially acts as a ‘specialized expert’ that extracts features from the error-detectors, to help correct mistakes! The network adapts to new data, without forgetting older lessons (that forgetfulness is the bane of re-training deep networks!)… I hope that negative covariance will facilitate the development of exceptionally deep networks that grow and adapt to new information quickly, without losing core insights that they learned in the past.

--

--