Analyzing different types of activation functions in neural networks — which one to prefer?

Published in

Towards Data Science

9 min readMay 7, 2019

While building a neural network, one of the mandatory choices we need to make is which activation function to use. In fact, it is an unavoidable choice because activation functions are the foundations for a neural network to learn and approximate any kind of complex and continuous relationship between variables. It simply adds non-linearity to the network.

Any neuron (except input layer neurons) calculates the weighted sum of inputs, adds some bias and then applies activation function to it. Before talking more about activation functions, let’s see first why it is required.

Additional NOTE
This article assumes that the reader has basic knowledge about the concept of the neural network, forward and backward propagation, weight initalization, optimization algorithms, etc. In case you are not familiar then I would recommend you to follow my other articles on these topics.
Forward propagation in neural networks — Simplified math and code version
Why better weight initialization is important in neural networks?
Why Gradient descent isn’t enough: A comprehensive introduction to optimization algorithms in neural networks

Need for an activation function

Refer to the below neural network with two hidden layers, input, and output layer. This network is suitable for a ternary class classification problem.

Consider a case where no activation function is used in this network, then from the hidden layer 1 the calculated weighted sum of inputs will be directly passed to hidden layer 2 and it calculates a weighted sum of inputs and pass to the output layer and it calculates a weighted sum of inputs to produce the output. The output can be presented as

So the output is simply a linear transformation of weights and inputs and it is not adding any non-linearity to the network. Therefore, this network is similar to a linear regression model which can only address the linear relationship between variables i.e. a model with limited power and not suitable for complex problems like image classifications, object detections, language translations, etc.

So a neural network without an activation function

can only represent a linear relationship between variables.
does not hold Universal approximation theorem.

Popular activation functions

There are many activation functions available. Since activation function is just a mathematical function so you can also come up with your own activation function. Commonly used activation functions are

Logistic
tanh ( hyperbolic tangent)
ReLu (Rectified linear units)
Leaky ReLu

Logistic activation function

It is a “S” shaped curve with equation

It ranges from 0 to 1. It is also referred as Sigmoid activation function. The weighted sum of inputs is applied to it as an input.

For a large positive input, it results in a large positive output which tends to fire and for large negative input, it results in a large negative output which tends not to fire.

The derivative of an activation function helps in calculation during backpropagation. During backpropagation, the derivative of loss w.r.t parameter is calculated. The derivative of the logistic function

Issues with the logistic activation function :

1. Saturation problem

A neuron is said to be saturated if it reaches to its peak value either maximum or minimum.

Saturation when f(x) = 0 or 1 then f’(x) = f(x)(1- f(x)) = 0

Why do we care about Saturation?

Refer to the below neural network where let us assume weight w211 needs to be updated during backpropagation using gradient descent update rule.

Now if h21 is found to be 1 then its derivative will be 0. So there is no any update in weight w211, this problem is known as vanishing gradient problem. The gradient of weight vanishes or goes down to zero.

So using logistic activation function, the saturated neuron may cause the gradient to vanish and therefore the network refuses to learn or keep learning at a very small rate.

2. Not a zero-centered function

A function having an equal mass on both the sides of zero line (x-axis) is known as a zero-centered function. Or in other words, in a zero-centered function, the output can be either negative or positive.

In the case of logistic activation function, the output is always positive and the output is always accumulated only towards one side (positive side) so it is not a zero-centered function.

Why do we care about zero-centered functions?

Let us assume weight w311 and w312 needs to be updated during backpropagation using gradient descent update rule.

h21 and h22 will be always positive due to the logistic activation function. The value of the gradient of w311and w312 can be positive or negative depending on the value of the common part.

So the gradient of all the weights connected to the same neuron is either positive or negative. Hence during update rule, these weights are only allowed to move in certain directions, not in all the possible directions. It makes the optimization harder.

It is similar to a situation where you are only allowed to move left and forward, not allowed to move right and backward then it is very hard to reach the desired destination.

Having understood this, let us point down the issues associated with logistic activation function

Saturated logistic neuron causes the gradient to vanish.
It is not a zero-centered function,
Because of e^x, it is highly compute-intensive which makes convergence slower.

2. Tanh (Hyperbolic tangent)

It is similar to logistic activation function with a mathematical equation

The output ranges from -1 to 1 and having an equal mass on both the sides of zero-axis so it is zero centered function. So tanh overcomes the non-zero centric issue of the logistic activation function. Hence optimization becomes comparatively easier than logistic and it is always preferred over logistic.

But Still, a tanh activated neuron may lead to saturation and cause vanishing gradient problem.

The derivative of tanh activation function

Issues with tanh activation function:

Saturated tanh neuron causes the gradient to vanish.
Because of e^x, it is highly compute-intensive.

3. ReLu (Rectified linear units)

It is the most commonly used function because of its simplicity. It is defined as

If the input is a positive number the function returns the number itself and if the input is a negative number then the function returns 0.

The derivative of ReLu activation function is given as

Advantages of ReLu activation function

Easy to compute.
Does not saturate for the positive value of the weighted sum of inputs.

Because of its simplicity, ReLu is used as a standard activation function in CNN.

But still, ReLu is not a zero-centered function.

Issues with ReLu activation function

ReLu is defined as max(0, w1x1 + w2x2 + …+b)

Now Consider a case b(bias) takes on (or initialized to) a large negative value then the weighted sum of inputs is close to 0 and the neuron is not activated. That means the ReLu activation neuron dies now. Like this, up to 50% of ReLu activated neurons may die during the training phase.

To overcome this problem, two solutions can be proposed

Initialize the bias(b) to a large positive value.
Use another variant of ReLu known as Leaky ReLu.

4. Leaky ReLu

It was proposed to fix the dying neurons problem of ReLu. It introduces a small slope to keep the update alive for the neurons where the weighted sum of inputs is negative. It is defined as

If the input is a positive number the function returns the number itself and if the input is a negative number then it returns a negative value scaled by 0.01(or any other small value).

The derivative of LeakyReLu is given as

Advantages of LeakyReLu

No saturation problem in both positive and negative region
The neurons do not die because 0.01x ensures that at least a small gradient will flow through. Although the change in weight will be small but after a few iterations it may come out from its original value.
Easy to compute.
Close to zero-centered functions.

Softmax activation function

For a classification problem, the output needs to be a probability distribution containing different probability values for different classes. For a binary classification problem, the logistic activation function works well but not for a multiclass classification problem. So Softmax is used for multiclass classification problem.

The softmax activation function is again a type of sigmoid function. As the name suggests, it is a “soft” flavor of the max function where instead of selecting only one maximum value, it assigns the maximal element largest portion of the distribution, and other smaller elements getting some part of the distribution.

Softmax is generally preferred in the output layer where we are trying to get probabilities for different classes in the output. It is applied to the obtained weighted sum of inputs in the output layer.

Need for the Softmax activation function

Consider a case where for a ternary classification problem we have preactivation outputs as [2.4, -0.6, 1.2] in the output layer. Preactivation output can be any real value.

Now if we directly calculate the probability for each class by taking one vale and dividing it by sum of all the values. Then we get probabilities [0.57, -0.14, 0.12]. Since probability can never be negative so here the predicted probability for class 2 is not accepted. Therefore we use the softmax activation function in the output layer for multiclass classification problem.

If the input value is negative then also softmax returns positive value because e^x always gives positive values.

End Notes: Now which one to prefer?

Having understood all the activation functions, now the interesting question is which one to use?

Well, there is no hard and fast rule for scenario-based activation function selection. The decision completely relies on the properties of the problem i.e the relationship you are trying to approximate possesses some properties and based on these properties you can try with different activation functions and select the one which helps in faster convergence and faster learning process or may be according to different evaluation parameters of your choice.

As a rule of thumb, you can start with ReLu as a general approximator and switch to other functions if ReLu doesn't provide better results.
For CNN, ReLu is treated as a standard activation function but if it suffers from dead neurons then switch to LeakyReLu.
Always remember ReLu should be only used in hidden layers.
For classification, Sigmoid functions(Logistic, tanh, Softmax) and their combinations work well. But at the same time, it may suffer from vanishing gradient problem.
For RNN, the tanh activation function is preferred as a standard activation function.