Fantastic activation functions and when to use them

Top 10 Activation functions, their pros, cons, when to use them, and a cheat sheet

Published in

Towards Data Science

7 min readFeb 15, 2022

Activation functions are an important component in an ML model, along with weights and biases. They have been instrumental in making Deep Neural Networks training, a reality and they are a constantly evolving field of study. In this blog post, we pit them against each other and observe their pros, cons, when to use them, and their evolution.

Paperswithcode lists 48 activation functions, with new and improved activation functions being proposed every year. In this article, I list 10 of the most referred activation functions that are representative of all the other activation functions.

Activation functions are of two types based on how it is used in an ML model.

Activation functions that are used in output layers of ML models (think classification problems). The primary purpose of these activation functions is to squash the value between a bounded range like 0 to 1.
Activation functions that are used in hidden layers of neural networks. The primary purpose of these activation functions is to provide non-linearity without which neural networks cannot model non-linear relationships.

Activation functions that are used in the hidden layer should ideally satisfy the following conditions,

Non-linear, to let the neural network learn non-linear relationships. Universal approximation theorem states that a two-layer neural network using a non-linear activation function can approximate any function.
Unbounded, to enable faster learning and avoid saturating early. When the range is infinite, gradient-based learning is efficient.
Continuously differentiable, This property is desirable though not decisive. ReLU is a glaring example of this, where the function is not differentiable ≤ 0, but still performs very well in practice.

Cheatsheet

A cheat sheet version of the article for the impatient,

Output layer activation functions

Sigmoid

Introduced in the early 1990s, these activation functions have the advantage of squashing the inputs to a value between 0 and 1, making it perfect to model probability. The function is differentiable but saturates quickly because of the boundedness leading to a vanishing gradient when used in a deep neural network. Exponential computation is costly which compounds when you have to train a model with hundreds of layers and neurons.

The function is bounded between 0 and 1 and the derivative is bounded between -3 and 3. Output is not symmetric around zero and this will cause all the neurons to take the same sign during training making it unsuitable for training hidden layers.

The wise old **sigmoid.** Image by Author

Use: It is generally used in logistic regression and binary classification models in the output layer.

Softmax

Extension of sigmoid activation function taking advantage of range of the output between 0 and 1. This is mainly used in the output layer of a multiclass, multinomial classification problem with a useful property of sum of the output probabilities adding up to 1. To put it another way, this is applying a sigmoid to every neuron on the output and then normalizing it to a sum of 1.

Use: Multinomial and multiclass classification.

Tanh

Used widely in 1990’s-2000’s, it overcomes the disadvantage of the sigmoid activation function by extending the range to include -1 to 1. This leads to zero-centeredness which leads to the mean of the weights of the hidden layer approaching zero. This leads to easier and faster learning. The function is differentiable and smooth, but using exponential comes at a cost. It saturates quickly and vanishing gradient seeps in when used in a hidden layer of a deep neural network. The derivative is steeper than sigmoid’s.

Use: Can be used in hidden layers for RNN. But there are better alternatives like ReLU

Though sigmoid and tanh functions can be used in hidden layer, because of its positive boundedness, training saturates quickly and vanishing gradients makes it impossible to use them in a deep neural network setting.

Hidden layer activation functions

Enter ReLU

Rectified Linear Unit is the rockstar of activation functions. This is the most widely used and a goto activation function for most types of problems. It has been around since 2010 and has been widely studied. It is bounded to 0 when it is negative, but it is unbounded in the positive direction. This mix of boundedness and unboundedness creates an inbuilt regularization, which is handy for a deep neural network. The regularization provides a sparse representation leading to computationally efficient training and inference.

Positive unboundedness accelerates the convergence of gradient descent while remaining computationally simple to calculate. The only major disadvantage of ReLU is dead neurons. Due to the negative boundedness to 0, some of the dead neurons which get turned off early in the training process never get turned on. The function quickly switches from unboundedness when x>0, to boundedness when x ≤ 0, making it not differentiable continuously. But in reality, with a low learning rate and large negative bias, this can be overcome without any residual impact on performance.

Use: CNN’s, RNN’s, and other deep neural networks.

ReLU is a clear winner for Hidden layer activation functions. Since 2010, ReLU has been studied for its pros and cons and new activation functions have been proposed which tend to improve ReLU’s performance while addressing its difficulties.

Improvements to ReLU

Leaky ReLU

Introduced in 2011, Directly addresses the concerns of ReLU, by allowing negative values to exist. This enables backpropagation for negative values thereby overcoming the dead neuron problem of ReLU. The negative values are scaled aggressively which leads to slower training. Even with this improvement, Leaky ReLU is not universally better than ReLU.

Use: Tasks involving sparse gradients like GAN.

Parametric ReLU

This improves upon Leaky ReLU, wherein the scalar multiple is not chosen arbitrarily but trained on the data. This is a double-edged sword, because of training from data leading to model sensitive to scaling parameter (a) and performing differently for different values of a.

Use: Can be used to fix dead neurons problem when Leaky ReLU doesn’t work.

ELU

Exponential Linear Unit, introduced in 2015 is positively unbounded and uses a log curve for negative values. This is a slightly different approach for dealing with dead neuron problem, unlike Leaky and Parameter ReLU. Unlike ReLU the negative values slowly smoothen out and get bounded avoiding dead neurons. But the negative slope is modeled using an exponential function which makes it costly. The exponential function can sometimes lead to exploding gradient when employing a sub-optimal initialization strategy.

GeLU

Gaussian Error Linear Unit, introduced in 2018 is the new kid in the block and clearly the winner for NLP-related tasks. It is used in SOTA algorithms like GPT-3, BERT, and other transformer-based architectures. GeLU combines dropout (zeroing out neurons randomly for a sparse network), zone out (maintain previous value), and ReLU. It weights inputs by percentile rather than gates, leading to a smoother version of ReLU.

Use: NLP, computer vision, and speech recognition

Swish

Introduced in 2017. Large negative values will have a derivative of 0, whereas small negative values are still relevant for capturing underlying patterns. Can be used as a drop in replacement for ReLU. The derivative has an interesting shape.

Use: It matches or outperforms ReLU in image classification and machine translation.

For more cheat sheets,
https://github.com/adiamaan92/mlnotes_handwritten

About Author:

I am a Data Scientist working in Oil 🛢️ & Gas ⛽. If you like my content, follow me 👍🏽 on LinkedIn, Medium, and GitHub. Subscribe to get an alert whenever I post on Medium.