
Neural Networks are universal approximators that maps data to information. What does this mean? Can Neural Networks solve any problem? Neural Networks are a proven solution for scene-by-scene/frame-by-frame analysis, stock price prediction, in retail, and for many other purposes. Many of us are using it at the enterprise level, but how many of us truly understand it?
To answer the question, ‘Can Neural Networks solve any problem?’, let’s take it from the basics. A NeuralNet is made up of vertically stacked components called layers: input, hidden, and output. Each layer consist of a certain number of neurons. The input layer has attributes (features) of the dataset. There can be multiple hidden layers with multiple neurons and the output layer can have multiple neurons based on the problem statement.
Understanding Perceptron and Activation functions
Perceptron (or a neuron) is a fundamental particle of neural networks. It works on the principle of thresholding. Let f(x) be a summation function with a threshold of 40.

In both cases, the defined function is returning the addition of two inputs, x₁, and x₂. In case 1, the function returns 30 which is lesser than the threshold value. In case 2, the function returns 50 which is greater than the threshold and the neuron will fire. Now, this function gets complicated than this. A neuron of a typical neural network receives a sum of input values multiplied by its weights and added bias, and the function, also known as the Activation function or Step function, helps in making a decision.

Activation Functions convert the output from the node into the binary output; 1 if the weighted input crosses the threshold, 0 otherwise (depends on the activation function). There are three most commonly known activation functions:
Sigmoid
Sigmoid is a widely used activation function that helps in capturing non-linear relationships.


For any value of z, function Φ(z) will always return a binary (0/1) output. Because of this, it is widely used in probability-based questions.
Tanh (Tangent Hyperbolic)
It is more-or-less like Sigmoid function but tanh ranges from -1 to 1 which makes it suitable for classification problems. It is non-linear.

ReLu (Rectified Linear Unit)
It is the most used activation function in deep learning because it is less complex than other activation functions. f(x) returns either 0 or x.


This makes computation easy as the derivative of the ReLu function returns either 0 or 1.
NeuralNet
To understand the black box of a Neural Network, let us consider a basic structure with 3 Layers; an Input layer, a Dense/Hidden layer (connected on both sides of the neurons), and an Output layer.

Weights and biases are randomly initialized. The accuracy of the output of a neural network is all about finding the optimal values for weights and biases by continuously updating them. Let us consider an equation, y = wx where ‘w’ is the weight parameter and ‘x’ is the input feature. In simple terms, weight defines the weightage given to the particular input attribute (feature). Now, the solution of the equation y = wx will always pass through the origin. Hence, an intercept is added to provide freedom so as to accommodate the perfect fit which is known as the bias and the equation becomes ŷ = wx + b, which we are all familiar with. Hence, bias allows activation functions’ curve to adjust up-or-down the axis.
Let us now see how complicated a neural network can get. For our network, there are two neurons in the input layer, four in the dense and one in the output. Every input value is associated with its weights and biases. The combination of input features with weights and biases goes through the dense layer where the network learns the feature with the help of the activation function and it has its own weights and biases, and finally makes the prediction (output). This is the Forward Propagation. So, how many total parameters are there for our network?

For such a simple network, there are a total of 17 parameters that are needed to optimize in order to get an optimal solution. As the number of hidden layers and the number of neurons in it increases, the network gains more power (up to a certain point), but then we have an exponential number of parameters to optimize which might end up taking a huge amount of computing resources. So, there is a trade-off.
UPDATING THE NETWORK
After a single iteration of forward propagation, Error is calculated by taking the (squared) difference between actual output and predicted output. In a network, inputs and activation functions are fixed. Hence, we can change the weights and biases to minimize the error. The error can be minimized by noticing two things: change in error by changing weights by small amounts, and the direction of the change.
Cost Function
A simple neural network predicts value based on Linear Relationship, ŷ = wx + b, where ŷ (predicted) is the approximation of y (actual). Now, there can be several linear lines fitting ŷ. To choose the best fit line, we define the Cost Function.
Let ŷ = θ₀ + xθ₁. We need to find values of θ₀ and θ₁ such that ŷ is as close to the y. To do that, we need to find values of θ₀ and θ₁ such that the following defined error is minimum.

Error, E = Squared difference between the actual and the predicted value which is = (ŷ – y)²
Hence, Cost= (1/2n)(θ₀ + xθ₁ – y)² where n is the total number of points for calculating the mean squared difference and it is divided by 2 to reduce the mathematical computation. Hence, we need to minimize this cost function.
Gradient Descent
It is the algorithm that helps in finding the best values for θ₀ and θ₁ by minimizing the cost function. I know that C= (1/2n)(θ₀ + xθ₁ – y)². For an analytical solution, we take partial differentiation of C with respect to the variables (θ‘s), known as Gradients.


These gradients represent the slope. Now, the original cost function is quadratic. So, the graph will look like this:

The equation to update θ is:

If we are at point P1, the slope is negative which makes the gradient negative and the whole equation positive. Hence, the point moves down in a positive direction until it reaches minima. Similarly, if we are at point P2, the slope is positive which makes the gradient positive, and the whole equation negative moving P2 in a negative direction until it reaches minima. Here, η is the rate by which a point moves towards minima known as the learning rate. All the _θ’_s are updated simultaneously (for certain epochs) and error is calculated.
On a side note
By following this, we can run into two potential problems: 1. While updating the values of θ, you could get stuck at local minima. A possible solution is to use Stochastic Gradient Descent (SGD) with momentum which helps in crossing local minima; and 2. If η is too small, it will take too long to converge. Alternatively, if η is too big (or even moderately high), it will keep on oscillating around minima and will never converge. Hence we cannot use the same learning rate for all parameters. To handle this, we can schedule a routine that adjusts the value of η as the gradient moves towards minima (e.g. cosine decay).
Backward Propagation
Series of operations that optimizes and updates weights and biases in a NeuralNet by using Gradient Descent algorithm. Let us consider a simple neural network (Fig 2.) with input, single hidden layer, and output.
Let, x be input, h be hidden layer, σ be sigmoid activation, w weights, b be bias, wᵢ be input weights, wₒ be output weights, bᵢ be input bias, bₒ be output bias, O be output, E is error and μ be the linear transformation ((∑wᵢxᵢ) + b).
Now, we are creating the computation graph of Fig. 2 by stacking up the series of operations it takes to reach from input to output.

Here, E is dependent on O, O is dependent on μ₂, μ₂ is dependent on bₒ, wₒ & h, h is dependent on μ₁ and μ₁ is dependent on x, wᵢ & bᵢ. We need to calculate intermediate changes (dependencies) with respect to weights and biases. Since there is only one hidden layer, there are input and output weights and biases. So, we can divide it into two separate cases.
Case 1: w.r.t. output weights and biases


So, by putting the values of derivatives in the above two equations of change in error, we get gradients as follows

and we can update weights and biases by the following equation:

This computation is for the hidden layer and output. Similarly, for input and hidden layer is as follows.
Case 2: w.r.t. input weights and biases


and we can update these gradients using:

Both of the cases are happening simultaneously and error is calculated till the number of repetitions called epochs. Neural Networks are supervised. After it runs for a certain number of epochs, we have a set of optimized weights and biases for selected features of a dataset. When new inputs in this optimized network are introduced, they are computed with the optimized values of weights and biases to achieve the maximum accuracy.
Can Neural Networks solve any problem?
As discussed, Neural Networks are universal approximators. In theory, they are capable of representing any function and hence they can solve any problem. As the network grows (more hidden layers), it gains more power, but then there is an exponential number of parameters to optimize which takes a considerable amount of resources.
Implementation can be found here.