The Math behind Artificial Neural Networks

A neuron as brain and math as heart

Published in

Towards Data Science

5 min readNov 17, 2019

Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. — Fei-Fei Li

Artificial Neural Networks have generated a lot of excitement in Machine Learning research and industry, thanks to many breakthrough results in speech recognition, computer vision and text processing.

In this article, I will focus on the basic structure of the neuron, how a neuron works and the math behind neural networks.

Perceptron

A simple artificial neuron having an input layer and output layer is called a perceptron.

What does this Neuron contain?

Summation function
Activation function

The inputs given to a perceptron are processed by Summation function and followed by activation function to get the desired output.

This is a simple perceptron, but what if we have many inputs and huge data a single perceptron is not enough right?? We must keep on increasing the neurons. And here is the basic neural network having an input layer, hidden layer, output layer.

We should always remember that a neural network has a single input layer, output layer but it can have multiple hidden layers. In the above fig, we can see the sample neural network with one input layer, two hidden layers, and one output layer.

As a prerequisite for a neural network let us know what an activation function and types of activation function.

Activation Function

The main purpose of the activation function is to convert the weighted sum of input signals of a neuron into the output signal. And this output signal is served as input to the next layer.

Any activation function should be differentiable since we use a backpropagation mechanism to reduce the error and update the weights accordingly.

Types of Activation Function

Sigmoid

Ranges from 0 and 1.
A small change in x would result in a large change in y.
Usually used in the output layer of binary classification.

Tanh

Ranges between -1 and 1.
Output values are centered around zero.
Usually used in hidden layers.

RELU (Rectified Linear Unit)

Ranges between 0 and max(x).
Computationally inexpensive compared to sigmoid and tanh functions.
Default function for hidden layers.
It can lead to neuron death, which can be compensated by applying the Leaky RELU function.

So far, we learned the prerequisites that are a perceptron and activation function. Now let us dive into the working of neural network (the core of neural network).

Working of Neural Network

A neural network works based on two principles

Forward Propagation
Backward Propagation

Let’s understand these building blocks with the help of an example. Here I am considering a single input layer, hidden layer, output layer to make the understanding clear.

Forward Propagation

Considering we have data and would like to apply binary classification to get the desired output.
Take a sample having features as X1, X2, and these features will be operated over a set of processes to predict the outcome.
Each feature is associated with a weight, where X1, X2 as features and W1, W2 as weights. These are served as input to a neuron.
A neuron performs both functions. a) Summation b)Activation.
In the summation, all features are multiplied by their weights and bias are summed up. (Y=W1X1+W2X2+b).
This summed function is applied over an Activation function. The output from this neuron is multiplied with the weight W3 and supplied as input to the output layer.
The same process happens in each neuron, but we vary the activation functions in hidden layer neurons, not in the output layer.

We just randomly initialized the weights and continued the process. There are many techniques for initializing the weights. But, you may be having a doubt how these weights are getting updated right??? This will be answered using back propagation.

Backward Propagation

Let us get back to our calculus basics and we will be using chain rule learned in our school days to update the weights.

Chain Rule

The chain rule provides us a technique for finding the derivative of composite functions, with the number of functions that make up the composition determining how many differentiation steps are necessary. For example, if a composite function f( x) is defined as

Chain rule

Let us apply the chain rule to a single neuron,

In neural networks, our main goal will be on reducing the error, to make it possible we have to update all the weights by doing backpropagation. We need to find a change in weights such that error should be minimum. To do so we calculate dE/dW1 and dE/dW2.

Once you have calculated changes in weights concerning error our next step will be on updating the weights using gradient descent procedure. Please check more details on gradient descent here.

Forward propagation and backward propagation will be continuous for all samples until the error reaches minimum value.

Wrapping Up

Here we learned about the basics of what is a neuron, how neural network works. But, this is not enough there are many problems in updating the weights like neuron death, weights crossing the range, etc. I will be coming up with a new article having what are the challenges with activation functions and how to drop neurons to reduce the training error.

References

Chain Rule, Cliff Notes, https://www.cliffsnotes.com/study-guides/calculus/calculus/the-derivative/chain-rule

Hope you enjoyed it !!! Stay tuned !!! Please do comment on any queries or suggestions !!!!