
We know that deep learning models are a combination of different components such as activation function, batch normalization, momentum, gradient descent, etc. So for this blog I picked up one part of DL and give a detailed explanation about activation function by answering the following questions-:
- What is an activation function?
- Why do we need an activation function in a neural network and what will happen if we do not use any?
- What are the desired properties of an activation function?
- What are the different types of activation function and their uses?
For this blog, I assume that you all have a basic understanding of neural networks. So without further ado, let’s delve deeper into Activation Functions.
Activation Function
An activation function is a function used to transform the output signal of the previous node in a neural network. This transformation helps the network to learn complex patterns in the data. That’s it! I also didn’t believe initially that it’s this simple, but it’s true.

Need for an activation function
Well yeah, you already know the basic reason by now. It is necessary to enable neural networks to learn complex patterns in the data. But how does it achieve that? Activation functions are in general non-linear functions that add non-linearity to the neural network, thus allowing them to learn more complex patterns. Still not clear? Let’s look at an example of a neural network to understand this.

For this example, please assume that we do not add bias. We can write the output Y as -:

Where Act represents the output of our activation function, (which is a non-linear transformation). Now assume that we do not have any activation function for the network, then our Y would look something like this-:

If you carefully look at the above equation, then

This means that even if we have two layers in our network, the input and output relation can actually be defined by a single weight matrix( which is a product of the two weight matrices). Hence we see that in the absence of activation function; it doesn’t matter how many layers we add to our neural networks, they can all be simplified and represented by just one weight matrix. But when we add activation functions, they add non-linear transformation, which prevent us from simplifying multiple layers of the neural network.
Another importance of an activation function is that it helps in limiting the value of the output from a neuron to a range we require. This is important because, input to the activation function is *Wx + b where W is the weights of the cell, and the x is the inputs, and then there is the bias b** added to that. If not restricted, this value can attain a very high magnitude, especially in the case of very deep neural networks that work with millions of parameters. This in turn leads to computational issues as well as value overflow issues.
I know that without an example of activation function, it’s difficult to digest all the facts I am throwing at you, but bear with me. We will soon go over all of these concepts in an example, but they need to be covered here for your better understanding.
Desired properties of an activation function
- As it’s clear from above, activation functions should be non-linear.
- Activation functions are used after each layer of a neural network, hence, it is always desirable that activation functions be computationally efficient.
- The main requirement for all the components of a neural network is that they should be differentiable, hence an activation function should also ** be differentiabl**e.
- One important aspect while designing anactivation function is to prevent the vanishing gradient problem. Explaining this problem in depth is beyond the scope of this blog but let me give you the gist. In order to prevent the vanishing gradient problem it is required that the derivative of activation function w.r.t input parameter is not bounded between -1 to 1 theoretically.
Different types of activation functions
ReLU
Relu stands for Rectified Linear Unit and is defined as f(x) = max(0, f(x))

This is a widely used activation function, especially with CNN (Convolutional Neural Network). It is easy to compute, does not saturate, and does not cause the Vanishing Gradient Problem. But, it does have an issue, i.e, for a negative input, its value becomes zero. Since the output is zero for all negative inputs, it causes some nodes to die completely and not learn anything. To handle this problem, Leaky ReLU or Parametric ReLU is used, which is F(x) = max(αx, x).

Sigmoid
This activation function is computationally expensive, causes the vanishing gradient problem, and is not zero-centered. This method is generally used for binary classification problems and used only at the end of a neural network to convert the output into a range of [0, 1]. This function in general is not used inside a neural network.

Softmax
It is used in multi-class classification problems. Like sigmoid, it produces values in the range of 0–1 therefore, it is used as the final layer in classification models.
Conclusions
I hope after this explanation, you now have a better understanding of why Neural Networks need an Activation Function and, what are the properties and types of an activation function. Let us know, by commenting on this blog, if you find it helpful.
Follow us on medium for more such content.
Become a Medium member to unlock and read many other stories on medium.