The world’s leading publication for data science, AI, and ML professionals.

5 Must-Know Activation Functions Used in Neural Networks

The essence of non-linearity.

Photo by Drew Patrick Miller on Unsplash
Photo by Drew Patrick Miller on Unsplash

The universal approximation theorem implies that a neural network can approximate any continuous function that maps inputs (X) to outputs (y). The ability to represent any function is what makes the Neural Networks so powerful and widely-used.

To be able to approximate any function, we need non-linearity. That’s where the activation functions come into play. They are used to add non-linearity to neural networks. Without activation functions, neural networks can be considered as a collection of linear models.

Neural networks are combinations of layers that contain many nodes. Thus, the building process starts with a node. The following represents a node without an activation function.

A neuron without an activation function (image by author)
A neuron without an activation function (image by author)

The output y is a linear combination of inputs and a bias. We need to somehow add an element of non-linearity. Consider the following node structure.

A neuron with an activation function (image by author)
A neuron with an activation function (image by author)

Non-linearity is achieved by applying an activation function to the sum of the linear combination of inputs and bias. The added non-linearity depends on the activation function.

In this post, we will talk about 5 commonly used activations in neural networks.


1. Sigmoid

The sigmoid function bounds a range of values between 0 and 1. It is also used in logistic regression models.

(image by author)
(image by author)

Whatever the input values to a sigmoid function are, the output values will be between 0 and 1. Thus, the output of each neuron is normalized into the range 0–1.

(image by author)
(image by author)

The output (y) is more sensitive to the changes on the input (x) for x values close to 0. As the input values move away from zero, the output value becomes less sensitive. After some point, even a large change in the input values result in little to no change in the output value. That is how the sigmoid function achieves non-linearity.

There is a downside associated with this non-linearity. Let’s first see the derivative of the sigmoid function.

(image by author)
(image by author)

The derivate tends towards zero as we move away from zero. The "learning" process of a neural network depends on the derivative because the weights are updated based on the gradient which basically is the derivate of a function. If the gradient is very close to zero, weights are updated with very small increments. This results in a neural network that learns so slow and takes forever to converge. This is also known as vanishing gradient problem.


2. Tanh (Hyperbolic Tangent)

It is very similar to the sigmoid except that the output values are in the range of -1 to +1. Thus, tanh is said to be zero centered.

(image by author)
(image by author)

The difference between the sigmoid and tanh is that the gradients are not restricted to move in one direction for tanh. Thus, tanh is likely to converge faster than the sigmoid function.

The vanishing gradient problem also exist for tanh activation function.


3. ReLU (Rectified Linear Unit)

The relu function is only interested in the positive values. It keeps the input values greater than 0 as is. All the input values less than zero become 0.

(image by author)
(image by author)

The output values of a neuron can be arranged to be less than 0. If we apply the relu function to the output of that neuron, all the values returned from that neuron become 0. Thus, relu allows cancelling out some of the neurons.

We are able to activate only some of the neurons with the relu function where as all of the neurons are activated with tanh and sigmoid which results in intense computations. Thus, relu converges faster than tanh and sigmoid.

The derivative of relu is 0 for input values less than 0. For those values, the weights are never updated during back-propagation and thus the neural network cannot learn. This issue is known as dying relu problem.


4. Leaky ReLU

It can be considered as a solution to the dying relu problem. Leaky relu outputs a small value for negative inputs.

(image by author)
(image by author)

Although leaky relu seems to be solving the dying relu problem, some argue that there is not a significant difference on the accuracy in most cases. I guess it comes down to trying both and see if there is any difference for a particular task.


5. Softmax

Softmax is usually used in multi-class classification tasks and applied to the output neuron. What it does is normalizing the output values into a probability distribution in a way the probability values add up to 1.

Softmax function divides the exponential of each output by the sum of the exponentials of all all the outputs. The resulting values form a probability distribution with probabilities that add up to 1.

Let’s do an example. Consider a case in which the target variable has 4 classes. The following is the output of the neural network for 5 different data points (i.e. observations).

Each column represents the output for an observation (image by author)
Each column represents the output for an observation (image by author)

We can apply the softmax function to these outputs as follows:

(image by author)
(image by author)

In the first line, we applied the softmax function to the values in matrix A. The second line reduced the floating point precision to 2 decimals.

Here is the output of the softmax function.

(image by author)
(image by author)

As you can see, the probability values in each column add up to 1.


Conclusion

We have discussed 5 different activation functions used in the neural networks. It is a must to use activation functions in neural networks in order to add non-linearity.

There is no free lunch! Activation functions also cause a burden to neural networks in terms of computational complexity. They also have an impact on the convergence of models.

It is important to know the properties of the activations functions and how they behave so that we can choose the activation function that best fits a particular task.

In general, the desired properties of an activation function are:

  • Computationally inexpensive
  • Zero centered
  • Differentiable. The derivative of an activation function needs to carry information about the input values because weights are updated based on the gradients.
  • Not causing vanishing gradient problem

Thank you for reading. Please let me know if you have any feedback.


Related Articles