What makes neural networks so special is their ability to model highly complex relationships between inputs and outputs. We cannot achieve these complex relations with linear models. Thus, neural networks need to be able to represent non-linearity. This is the reason why we use activation functions in neural networks. Without activation functions, neural networks can be thought of as a bag of linear models.
Activation functions add learning power to neural networks. Many tasks that are solved with neural networks contain non-linearity such as images, texts, sound waves. Thus, we need non-linearity to solve most common tasks in the field of Deep Learning such as image and voice recognition, natural language processing and so on.
A neuron without an activation function is just a linear combination of inputs and a bias.

This is just one neuron. A typical hidden layer in a neural network has many neurons. However, without activation functions, we have many different linear combinations of the inputs. By adding additional hidden layers, we get linear combinations of linear combinations of inputs which do not give us the complexity of non-linearity. However, when we add activation functions to the neurons, we leave the curse of linearity behind and are able to model complex, non-linear relationships.

There are many different activations functions available. Each has its own pros and cons as well as specific characteristics. We are not only interested in the activation function itself but also its derivative. The reason is the way that neural networks actually "learn".
Inputs are multiplied by weights and bias is added. Then activation function is applied to get output. This process is known as forward-propagation. The output of neural network is compared to the actual target value and difference (or loss) is calculated. The information about the loss is fed back to the neural network and weights are updated so that the loss is reduced. This process is called back-propagation. The weights are updated using gradient descent algorithm which is based on the derivatives. Thus, the derivative of an activation function should also carry information about the input values.
There are many activations functions available. We will cover the most commonly used non-linear activation functions in neural networks.
Sigmoid
We are familiar with sigmoid function from logistic regression. It is the famous S shaped function that transforms the input values into a range between 0 and 1.

Let’s plot it. We first create a numpy array and apply the sigmoid function to it.
import numpy as np
import pandas as pd
x = np.linspace(-10,10,50)
y = 1 / (1 + np.exp(-x))
Then plot it using matplotlib.
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8,5))
plt.title("Sigmoid Function", fontsize=15)
plt.grid()
plt.plot(x, y)

The output of each neuron is normalized between 0 and 1. The mid region in the graph above shows that small changes in x (input) creates relatively big changes in y (output). Sigmoid function is good at detecting differences in these regions which makes it a good classifier. Thus, it is commonly used in binary classification tasks.
Nothing is perfect, unfortunately. There is a downside of sigmoid function. As we move away from the center, the changes in the value of x create little or no change in the value of y. Let’s take a look at the derivative of sigmoid function. We can use the gradient function of numpy to compute the derivative:
x = np.linspace(-10,10,50)
dx = x[1]-x[0]
y = 1 / (1 + np.exp(-x))
dydx = np.gradient(y, dx)
Then we plot both y and dydx on the same graph:
plt.figure(figsize=(8,5))
plt.title("Sigmoid Function and Its Derivative", fontsize=15)
plt.grid()
plt.plot(x, y)
plt.plot(x, dydx)

As we can see the derivate tends towards zero as we move away from the center. Before commenting on this graph, let us remember how neural networks learn. The learning of a neural network means updating the weights in order to minimize the loss (the difference between actual and predicted values). The weights are updated based on the gradient which basically is the derivative of a function. If the gradient is very close to zero, weights are updated with very small increments. This results in a neural network that learns so slow and takes forever to converge. This is also known as vanishing gradient problem.
Tanh (Hyperbolic Tangent)
Tanh is very similar to sigmoid function except that it is symmetrical around the origin. The output values are bounded in the range of (-1,+1).
x = np.linspace(-5,5,80)
y_tanh =2*(1 / (1 + np.exp(-2*x)))-1
plt.figure(figsize=(8,5))
plt.title("Tanh Function", fontsize=15)
plt.grid()
plt.plot(x, y_tanh)

Tanh is zero centered so that gradients are not restricted to move in one particular direction. Thus, it converges faster than the sigmoid function.
The derivative of tanh is similar to the derivative of sigmoid but it is steeper.

As we can see from the derivative line, tanh also suffers from vanishing gradient problem.
ReLU (Rectified Linear Unit)
The output of relu function is equal to the input value for inputs that are greater than 0. For all other input values, the output is 0.
x = np.linspace(-10,10,50)
y_relu = np.where(x < 0, 0, x)
plt.figure(figsize=(8,5))
plt.title("ReLU Function", fontsize=15)
plt.grid()
plt.plot(x, y_relu)

Relu makes it possible to activate only some of the neurons in the network which makes more computationally efficient than tanh and sigmoid. All of the neurons are activated with tanh and sigmoid which results in intense computations. Thus, relu converges faster than tanh and sigmoid.
The derivative of relu is 1 for input values greater than 0. For all other input values, the derivative is 0 which results in some of the weights are never updated during back-propagation. Thus, neural network cannot learn for negative input values. This issue is known as dying relu problem. One solution to use leaky relu function.
Leaky ReLU
It is same as relu except for positive input values. For negative values, leaky relu outputs a very small number whereas relu just gives 0.

Softmax
Softmax takes input values of real numbers and normalize them into a probability distribution. Probabilities are proportional to the exponentials of input values. Consider, the output layer of neural networks have 10 neurons. Softmax function takes these 10 outputs and creates a probability distribution. The probabilities of 10 values add up to 1.
Softmax activation is used in classification tasks with multiple classes.
Swish
Swish is a self-gated activation function and relatively new compared to the ones we have discussed so far. It is discovered by researchers at Google. In terms of computational efficiency, it is similar to relu but it performs better than relu on deeper models. As stated by the researchers, "swish tends to work better than relu on deeper models across a number of challenging datasets".

Let’s plot the graph of swish.
x = np.linspace(-5,5,50)
y_swish = x*(1 / (1 + np.exp(-x)))
plt.figure(figsize=(8,5))
plt.title("Swish Function", fontsize=15)
plt.grid()
plt.plot(x, y_swish)

Swish does not have a sharp edge at 0, unlike relu which makes it easier to converge during training.
We have discussed 6 different activation functions. They all have pros and cons in some aspects. Activation functions have an impact on computational complexity and the convergence of models. Thus, it is better to understand how they behave so that we can choose the optimum activation function based on the task. In general, the desired properties of an activation function are:
- Computationally inexpensive
- Zero centered
- Differentiable. The derivative of an activation function needs to carry information about the input values because weights are updated based on the gradients.
- Not causing vanishing gradient problem
Thank you for reading. Please let me know if you have any feedback.