Comparison of Activation Functions for Deep Neural Networks

Step, Linear, Sigmoid, Hyperbolic Tangent, Softmax, ReLU, Leaky ReLU, and Swish Functions are explained with hands-on!

Ayyüce Kızrak, Ph.D.
Towards Data Science

--

🔥 Activation functions play a key role in neural networks, so it is essential to understand the advantages and disadvantages to achieve better performance.

It is necessary to start by introducing the non-linear activation functions, which is an alternative to the best known sigmoid function. It is important to remember that many different conditions are important when evaluating the final performance of activation functions. It is necessary to draw attention to the importance of mathematics and the derivative process at this point. So, if you’re ready, let’s roll up the sleeves and get our hands dirty! 🙌🏻

Source

What is the artificial neural network? Let us remember this first: The artificial learning technique based on the modeling/imitation of the learning structure through the nervous system of living things. The neural structure is realized by a hierarchical electrical current sensing process. The electrical impulse taken from the receptors allows us to learn, remember and memorize everything we have seen, heard, felt and thought about since we were born. Neuroscience is a very deep and intriguing study field.

Why Do We Need Activation Function?

CS231n: Convolutional Neural Networks for Visual Recognition

We need the activation function to introduce nonlinear real-world properties to artificial neural networks. Basically, in a simple neural network, x is defined as inputs, w weights, and we pass f (x) that is the value passed to the output of the network. This will then be the final output or the input of another layer.

Why can’t we switch this signal to the output without activating it?

If the activation function is not applied, the output signal becomes a simple linear function. Linear functions are only single-grade polynomials. A non-activated neural network will act as a linear regression with limited learning power. But we also want our neural network to learn non-linear states. Because we will give you complex real-world information such as image, video, text, and sound to learn to our neural network. Multilayered deep neural networks can learn meaningful featıures from data.

So Why Do Nonlinear Functions Need?

Functions with multiple degrees are called non-linear functions. Artificial neural networks are designed as universal function approximators and are intended to work on this target. This means that they must have the ability to calculate and learn any function. Thanks to the non-linear activation functions, stronger learning of networks can be achieved. Already this article is completely related to this issue 😇

In order to calculate the error values related to the weights, the back-propagation algorithm of the artificial neural network is applied. It is necessary to determine the optimization strategy and to minimize the error rate. Selecting the appropriate optimization algorithm is also a separate issue.

Input. Times Weight. Add a bias. Activate!

ACTIVATION FUNCTIONS

Step Function

Step Function and Derivative

It is a function that takes a binary value and is used as a binary classifier. Therefore, it is generally preferred in the output layers. It is not recommended to use it in hidden layers because it does not represent derivative learning value and it will not appear in the future. Then, let’s think of a derivative function, the linear function comes to mind immediately.

Linear Function

Linear Function and Derivative

It generates a series of activation values and these are not binary values, as in the step function. It certainly allows you to connect several neurons (nerve cells) together. But this function has a major problem! Fixed the derivative. So why do we need its derivative and what is the negativity of it being fixed? What we said; with the backpropagation algorithm, we performed the learning process for neurons. This algorithm consists of a derivative system. When A = c.x is derived from x, we reach c. This means that there is no relationship with x. Well, if the derivative is always a constant value, can we say that the learning process is taking place? Unfortunately no!

There is another problem! When the linear function is used in all layers, the same linear result is reached between the input layer and the output layer. The linear combination of linear functions is another linear function. This means that the neurons that we call at the very beginning can interfere with the interconnecting layers!🙄

Sigmoid Function

Imagine that most problems in nature are not linear and combinations of sigmoid function are not linear. Bingo!

Sigmoid Function and Derivative

Then we can sort the layers 😃 So let’s think of non-binary functions. It is also derivated because it is different from the step function. This means that learning can happen. If we examine the graph x is between -2 and +2, y values change quickly. Small changes in x will be large in y. This means that it can be used as a good classifier. Another advantage of this function is that it produces a value in the range of (0,1) when encountered with (- infinite, + infinite) as in the linear function. So the activation value does not vanish, this is good news!🎈

The sigmoid function is the most frequently used activation function, but there are many other and more efficient alternatives.

So what’s the problem with sigmoid function?

If we look carefully at the graph towards the ends of the function, y values react very little to the changes in x. Let’s think about what kind of problem it is! 🤔 The derivative values in these regions are very small and converge to 0. This is called the vanishing gradient and the learning is minimal. if 0, not any learning! When slow learning occurs, the optimization algorithm that minimizes error can be attached to local minimum values and cannot get maximum performance from the artificial neural network model. So let’s continue our search for an alternative activation function!🔎

Hyperbolic Tangent Function

Hyperbolic Tangent and Derivative

It has a structure very similar to Sigmoid function. However, this time the function is defined as (-1, + 1). The advantage over the sigmoid function is that its derivative is more steep, which means it can get more value. This means that it will be more efficient because it has a wider range for faster learning and grading. But again, the problem of gradients at the ends of the function continues. Although we have a very common activation function, we will continue our search to find the best one!

ReLU (Rectified Linear Unit) Function

ReLU Function and Derivative

At first glance, it will appear to have the same characteristics as the linear function on the positive axis. But above all, ReLU is not linear in nature. In fact, a good estimator. It is also possible to converge with any other function by combinations of ReLU. Great! That means we can still sort the layers in our artificial neural network (again) 😄

ReLU is valued at [0, + gö], but what are the returns and their benefits? Let’s imagine a large neural network with too many neurons. Sigmoid and hyperbolic tangent caused almost all neurons to be activated in the same way. This means that the activation is very intensive. Some of the neurons in the network are active, and activation is infrequent, so we want an efficient computational load. We get it with ReLU. Having a value of 0 on the negative axis means that the network will run faster. The fact that the calculation load is less than the sigmoid and hyperbolic tangent functions has led to a higher preference for multi-layer networks. Super! 😎 But even ReLU isn’t exactly great, why? Because of this zero value region that gives us the speed of the process! So the learning is not happening in that area. 😕 Then you need to find a new activation function with a trick.

Leaky-ReLU Function

💧Can you see the leak on the negative plane?😲

Leaky ReLU Function and Derivative

This leaky value is given as a value of 0.01 if given a different value near zero, the name of the function changes randomly as Leaky ReLU. (No, no new functions ?!😱) The definition range of the leaky-ReLU continues to be minus infinity. This is close to 0, but 0 with the value of the non-living gradients in the RELU lived in the negative region of learning to provide the values. How smart is that? 🤓

Softmax Function

It has a structure very similar to Sigmoid function. As with the same Sigmoid, it performs fairly well when used as a classifier. The most important difference is that it is preferred in the output layer of deep learning models, especially when it is necessary to classify more than two. It allows determining the probability that the input belongs to a particular class by producing values in the range 0-1. So it performs a probabilistic interpretation.

Swish (A Self-Gated) Function

Swish Function and Derivative

The most important difference from ReLU is in the negative region. Leaky had the same value in ReLU, what was the difference in it? All other activation functions are monotonous. Note that the output of the swish function may fall even when the input increases. This is an interesting and swish-specific feature.

f(x)=2x*sigmoid(beta*x)

If we think that beta=0 is a simple version of Swish, which is a learnable parameter, then the sigmoid part is always 1/2 and f (x) is linear. On the other hand, if the beta is a very large value, the sigmoid becomes a nearly double-digit function (0 for x<0,1 for x>0). Thus f (x) converges to the ReLU function. Therefore, the standard Swish function is selected as beta = 1. In this way, a soft interpolation (associating the variable value sets with a function in the given range and the desired precision) is provided. Excellent! A solution to the problem of the vanish of the gradients has been found.

Mathematical Expressions of Activation Functions

Check out here for gradient and partial derivative visualization!

WHICH ACTIVATION FUNCTION SHOULD BE PREFERRED?

Of course, I won’t say that you will use it or that. Because I have listed the unique advantages and disadvantages of each activation function. The sigmoid function can be used if you say that the hyperbolic tangent or model can be learned a little slower because of its wide range of activating functions. But if your network is too deep and the computational load is a major problem, ReLU can be preferred. You can decide to use Leaky ReLU as a solution to the problem of vanishing gradients in ReLU. But you do more computation than ReLU.

So the activation function is a critical optimization problem that you need to decide on the basis of all this information and the requirements of your deep learning model.

Source
  • Easy and fast convergence of the network can be the first criterion.
  • ReLU will be advantageous in terms of speed. You’re gonna have to let the gradients die/vanish. It is usually used in intermediate layers rather than an output.
  • Leaky ReLU can be the first solution to the problem of the gradients’ vanish.
  • For deep learning models, it is advisable to start experiments with ReLU.
  • Softmax is usually used in output layers.

You can reach countless articles evaluating their comparison. My best advice is to get your hands dirty! So, test yourself, if you’re ready…

DEFINITION AND PLOTTING OF ACTIVATION FUNCTIONS

First of all, let’s take a look at the identifying and plotting the activation functions:

🌈You can find the codes in Google Colab.

Demonstration of Activation Functions

EVALUATION OF PERFORMANCE OF ACTIVATION FUNCTIONS

Let’s take a look at the comparison of activation functions for the Convolutional Neural Network Model on the classic MNIST dataset we call State of the Art.

Summary of Deep Learning Model Used

When the model with 2 convolution layers is applied for Sigmoid, Hyperbolic Tangent, ReLU, Leaky ReLU, and Swish functions, you can observe how some others are ahead of others and how close some are. You can test for different data sets. In addition, other, epoch, batch-size, dropout, such as the effect of parameters can also examine. Maybe the subject of my next article can be one of them!

The results of the validation and training accuracy and loss values of the samples are given for 20 epoch. Test results are also shown as a table below.

Comparison of Validation and Training for Different Activation Functions (TRAINING)
Comparison of Validation and Training for Different Activation Functions (LOSS)
Comparison of Activation Functions for Deep Neural Networks by Merve Ayyüce Kızrak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

🌈You can find the codes in Google Colab 👇🏻

Source

🌈How do activation functions help in creating nonlinear decision limits? You can reach additional application here!

--

--

AI Specialist @Digital Transformation Office, Presidency of the Republic of Türkiye | Academics @Bahçeşehir University | http://www.ayyucekizrak.com/