Representation Power of Neural Networks

Published in

Towards Data Science

7 min readNov 6, 2018

We are aware of neural networks and their countless achievements in multiple domains from data science to computer vision. It is known that they are good at solving complex tasks that involve generalizations. Mathematically speaking they are very good at approximating behavior of any complex function. Let’s understand this concept of approximation visually instead of a forward and backward propagation approach in which error in prediction is minimized. Assuming that you are aware of minor basics of forward & backward propagation which aims at approximating behavior of a function with the help of gradients and error propagation across the network. Let’s develop an understanding of approximating capabilities of a neural network with an alternative visual explanation. Which involve basic mathematics and graphical analysis.

Mathematically we will look at representation power of given neural network for provided function to be approximated.

Representation power is related to ability of a neural network to assign proper labels to a particular instance and create well defined accurate decision boundaries for that class. In this article we will explore a visual approach for learning more about approximating behavior of a neural network which is in direct relation to representation power of neural network.

The Journey

It all begins with MP Neuron which is very simplified model of a neuron. With very simple idea of summation of all inputs being larger than a threshold for a given function the neuron triggers otherwise it doesn’t. Very primitive beginning in deed. For checking its representation power let’s see a geometric interpretation. First a 2-D analysis with 2 inputs for approximating OR function and then moving onto 3-D analysis with 3 inputs.

For separation in a 2-D coordinate system a line is required. The neuron will fire for points right to the line. Hence, a separation boundary created.

For separation in 3-D coordinate system a plane is required. Neuron will fire for all points above the plane.

Hence, McCulloch Pitts Neuron can be used to represent any Boolean function
which is linearly separable. Also, we can see a strict demarcation rule instead of a gradual decision boundary with anything slightly above the separating boundary as 1 and just below as 0. Neurons triggers with step function like behavior. More flexibility is achieved with Perceptrons having weights attached to each of its inputs but strict demarcation still exists. But, the above mechanisms fails for non linearly separable functions. A very simple example is XOR, imagine drawing a separating line in a 2-D plane for this function. There won’t be one! Very much like XOR most of the data is linearly inseparable in nature.

Hence, advanced computational models like current neural networks are required for these creating a separating boundaries for these functions. Just see a basic illustration with one hidden layer and some predefined weights that replicates XOR function.

Conditions for implementation of XOR: w1<w0, w2≥w0, w3≥w0, w4<w0

Remember: Any Boolean function of n inputs can be represented by a network of perceptrons containing 1 hidden layer with 2^n perceptrons and one output layer containing 1 perceptron. It is a sufficient condition not necessary.

With our analysis of single hidden layer networks with step function like approximations. It is having limitations with its harsh judging criteria equivalent to a step function. Let’s dive into multilayer deep networks with sigmoidal non-linear approximation functions.

Story Today

Representation power of sigmoidal neurons is much higher. A multilayer network of neurons with a single hidden layer can be used to approximate any continuous function to any desired precision.

Mathematically, there is a guarantee that for any function f(x): R(n) → R(m) , we can always find a neural network with hidden layer/s whose output g(x) satisfies |g(x)−f (x)| < Θ

The above claim is quite big in nature. As it would mean that we can approximate any function with a given neural network. In the mathematical terms, the universal approximation theorem states that an autoencoder network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of R(n), under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of functions when given appropriate parameters. However, it does not touch upon the algorithmic convergence of those parameters. Convergence part is related to forward and backward propagation algorithm. Let’s understand an intuitive explanation for above theorem stated which acts as fundamental basis for learning behavior in neural networks.

Geometric Interpretation of Function Approximations. Classical mathematical approach in Numerical Approximation.

End Game: Towers of Sigmoids

Continuing the above conversation of approximation of functions with neural networks. Just see the below diagram and decide for yourself. A function can be approximated with superimposing several towers functions. This process will form a shape equivalent to given function being approximated with some minors approximation errors in it. Now, above interpretation of Universal Approximation Theorem tells us that more number of towers we use for approximation better will be the approximation behavior. Hence, parameters are tuned in sigmoidal activations with an aim of creating such approximation towers. Theoretically there is no limit upon the accuracy of neural networks as per this interpretation.

Clearly, more the number of towers better the approximation & lesser the approximation error.

Let’s dive more deeper into this explanation procedure. All these “tower” functions are similar and only differ in their heights and positions on the x-axis. Now, we have to look how these Towers are getting created with sigmoid activation functions.

Our aim is to figure out the black-box **Tower Maker** for tower construction.

A typical logistic sigmoid activation function equation is given as follow.

w: represents **weights** | b: represents **bias**

With increase in w the function becomes more steeper like step function. More positive values of b moves the curves towards left from original curve.

Hence, by changing these values we can create different version of sigmoids which we can superimpose on each-other to obtain tower like structures. In order to create tower in 2-D coordinate systems subtract curves with two different values for biases.

Left curve is having more positive value of **bias b.** Hence, with multiple such towers above random curve can be approximated or represented.

We can extend this operation towards hidden layers of neural networks to construct a neural network which simulates this subtraction of curve behavior. Hence, neural network can represent any such function with parametric values for weights and biases which we continuously determine with our forward and backward propagation algorithm until a convergence criteria.

Now, the above illustrated random curve of a function can be approximated by superimposing such towers.

Case Study

Consider the scenario with more than one input. Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed. Further, suppose we base our decision on two factors: Salinity(x1) and Pressure(x2). We are given some data and it seems that y(oil|no-oil) is a complex function of x1 and x2. We want a neural network to approximate this function.

Illustration above plots the above scenario. Clearly, we need 3-D towers which approximates the distribution function.

In our understanding we need to make such 3-D closed towers in a 3-D coordinate systems. If we carry on the similar approach mentioned above of subtraction of two sigmoids with different biases in 3-D space. We will get the following equivalent curve.

But, we can see that if we take another horizontally perpendicular tower to current resultant curve above. We can get the closed tower upon superimposing these two horizontally perpendicular open towers.

We can pass above output through another combining sigmoid activation function to get a proper tower best for approximation.

We can now approximate any function by summing up many such
towers

Complex distribution function in above case study can be reconstructed with the help of multiple such towers. Here, let’s see a neural network to represent above mentioned procedure.

The illustrative proof that we just saw tells us that we can have a neural network with two hidden layers which can approximate the above function by a sum of towers. Which means we can have a neural network which can exactly separate such distributions like mentioned in above case study. There is no theoretical limitation upon the accuracy of neural networks.

We are interested in separating the blue points from the red points. With single sigmoidal neuron there are clearly errors that will exist. But, with two hidden layers we can approximate the above function by a sum of towers. We can have a neural network which can exactly separate the blue points from the red points !!

Acknowledgements

Visualization approach for approximation is quite unique and interesting that's why I felt a need for this discussion. I have merely restructured the already existing explanations from neuralnetworksanddeeplearning.com. Descriptive illustrations are borrowed from Deep Learning Course CS7015. Thanks to Professor Mitesh M. Khapra and his TAs for this wonderful course!!