Activation Functions in Neural Networks

The motive, use cases, advantages and limitations

Published in

Towards Data Science

9 min readDec 31, 2018

tl;dr The post discusses the various linear and non-linear activation functions used in deep learning and neural networks. We also take a look into how each function performs in different situations, the advantages and disadvantages of each then finally concluding with one last activation function that out-performs the ones discussed in the case of a natural language processing application.

Introduction

Continuing on with the topics on neural networks. Today we are going to discuss what Activation Functions are and try to understand the underlying concept. We do a deep dive into both linear and non-linear versions of these functions and explore where they work best and in where they fall short.

Before we move on what activation functions are, lets first refresh our brains on how neural networks operate. So, a neural network is a complex mesh of artificial neurons that imitates how the brain works. They take in input parameters with its associated weights and biases then we compute the weighted sum of the ‘activated’ neurons. Our activation function gets to decide which neurons will push forward the values into the next layer. How it works, is what we are going to see next.

The Underlying Concept

In the forward propagation steps, our artificial neurons receive inputs from different parameters or features. Ideally, each input has its own value or weight and biases which can display interdependency that leads to change in the final prediction value. This is a phenomenon referred to as the Interaction effect.

A good example would be when trying a regression model on a dataset of diabetic patients. The goal is to predict if a person runs the risk of diabetes based on their body weight and height. Some bodyweights indicate a greater risk of diabetes if a person has a shorter height compared to a taller person who relatively has better health index. There are of course other parameters which we are not considering at the moment. We say there is an interaction effect between height and bodyweight.

The activation function takes into account the interaction effects in different parameters and does a transformation after which it gets to decide which neuron passes forward the value into the next layer.

Linear function

We start off with the simplest function; the linear function. The value of f(z) increases proportionally with the value of z. The input value is the weighted sum of the weights and biases of the neurons in a layer. The linear function solves the issue of a binary step function where it reports only a value of 0 and 1.

**Fig 1:** Performance of Linear or Identity Activation Function

The output of the function is not confined between any range; that is, the value of f(z) can go from which not necessarily a problem as we can proceed into the next or final layer by taking the max value of the neurons that have fired after com. Apart from that, the linear activation function has its set of disadvantages such as:

We observe that the function’s derivative is a constant. That means there is constant gradient descent occurring since there is no relation to the value of z.
Our model is not really learning as it does not improve upon the error term, which is the whole point of the neural network.
Since the activation is linear, nesting in 2 or N number of hidden layers with the same function will have no real effect. The N layers could basically be squashed into one layer.

We see that this function is not fit to handle complex. So in order to fix this issue, we use non-linear functions to enable our model to learn iteratively.

Sigmoid Function (σ)

The Sigmoid function takes a value as input and outputs another value between 0 and 1. It is non-linear and easy to work with when constructing a neural network model. The good part about this function is that continuously differentiable over different values of z and has a fixed output range.

**Fig 2:** Performance of Sigmoid Activation Function

On observation, we see that the value of f(z) increases but at a very slow rate. The mathematical reason is that as z (on the x-axis) increases, the value of e exponent -z becomes infinitesimally small and the value of f(z) become equals to 1 at some point. In other words, the function is susceptible to the vanishing gradient problem that we discussed in the previous lecture. The high-level problem is that models that utilize the sigmoid activation our slow learners and in the experimentation phase will generate prediction values which have lower accuracy.

Another issue with this function arises when we have multiple hidden layers in our neural network. All the values we are getting through this activation function are positive and sigmoid churns out values of different magnitudes between 0–1 range so it becomes hard to optimize.

Disadvantages aside, we do see sigmoid function used especially in binary classification models as part of the output layer to capture the probability ranging from 0 to 1. In most cases, we usually encounter problems that involve multiple classes and for such models, we use the Softmax function (Details of this function are further explained in the previous lecture).

Tanh Function

The Tanh function is a modified or scaled up version of the sigmoid function. What we saw in Sigmoid was that the value of f(z) is bounded between 0 and 1; however, in the case of Tanh the values are bounded between -1 and 1.

This is neat in the sense, we are able to get values of different signs, which helps us in establishing which scores to consider in the next layer and which to ignore. But, this function still has the vanishing gradient problem that was seen in the sigmoid function. The model slows down exponentially beyond the +2 and -2 range. The change in gradient is very small except within this narrow range.

**Fig 3:** Performance of Tanh Activation Function

Last bit about Tanh and Sigmoid, the gradient or the derivative of the Tanh function is steeper as compared to the sigmoid function which we can observe in figure 4 below. Our choice of using Sigmoid or Tanh really depends on the requirement of the gradient for the problem statement.

**Fig 4:** Comparision of Sigmoid and Tanh Activation Function

Rectified Linear Unit Function (ReLU)

The Rectified Linear Unit or ReLU for short would be considered the most commonly used activation function in deep learning models. The function simply outputs the value of 0 if it receives any negative input, but for any positive value z, it returns that value back like a linear function. So it can be written as f(z)=max(0,z)

**Fig 5:** Performance of ReLU Activation Function

However, it should be noted that the ReLU function is still non-linear so we are able to backpropagate the errors and have multiple layers of neurons. This function was quickly adopted, as ReLU took care of several problems faced by the Sigmoid and the Tanh:

The ReLU function has a derivative of 0 over half of its range which spans across all the negative numbers. For positive inputs, the derivative is 1. So we have rectified the ‘vanishing’ gradient problem.
At a time only a few neurons are activated making the network sparse making it efficient (We will see that sparsity is not always a good thing)
It is computationally economical compared to Sigmoid and Tanh.

ReLU has its own set of limitations and disadvantages despite being a better activation function than its other non-linear alternatives:

The function suffers from the dying ReLU problem — For activations correspondent to values of z< 0, the gradient will be 0 because of which the weights will not get adjusted during the gradient descent in backpropagation. That means, such neurons will stop responding to variations in error/input, so the output network becomes passive due to added sparsity.
It is best used in between the input and output layers, more specifically within hidden

As a final note on ReLU, there is a way to counter the dying ReLU problem by using a modified version of the function called the ‘Leaky’ ReLU. To put it briefly, we take the z < 0 values that form the y = 0 line and convert it into a non-horizontal straight line by adding a small, non-zero, constant gradient α (Normally, α=0.01). So our new formula for z < 0 is f(z) = αz.

For more information, I have attached the Research paper for Leaky ReLU: https://arxiv.org/abs/1502.01852

Bonus: Non-linear Cube Activation Function

There is one interesting activation function that is applied to Part of Speech (POS) tagging that works better than all the other functions mentioned in the article. The paper by Manning and Danqi Chen (2014) proposed a novel way of learning a neural network classifier for use in a greedy, transition-based dependency parser for natural language processing. They introduce a non-linear cube function denoted by z³ which is further elaborated in the equation of the hidden layer below:

Fig 6: Cube Activation function that takes weights of the words, tags, and labels along with its bias.

The equation above signifies the weighted sum of the input units, which in this case are the word, tag and label embeddings plus the associated bias. The sum is then raised to the power of 3 to compute the output value for that iteration.

The performance of this particular activation function can be seen in the graph plot below. It is not to scale but we can determine the error terms are reduced as learning is happening at an exponential rate (z³) for this case.

**Fig 7:** Performance of Cube (z³) Activation Function

The paper is thorough in its experiments where the model is rigorously tested against other state-of-the-art models and to quickly draw a comparison between the activation functions with respect to performance, we observe that:

Non-Linear Cube > ReLU > Tanh > Sigmoid

To understand more about how the model runs, I have attached the link of Manning’s paper below.

A Fast and Accurate Dependency Parser using Neural Networks:
https://cs.stanford.edu/~danqi/papers/emnlp2014.pdf

Conclusion

We learned the numerous activation functions that are used in ML models. For researchers, these functions are used to draw a comparison of what works best given the problem statement. There is no hard and fast rule for selecting a particular activation function. However, it depends upon the model’s architecture, the hyperparameters and the features that we are attempting to capture. Ideally, we utilize the ReLU function on our base models but we can always try out others if we are not able to reach an optimal result.

As a last note, feel free to comment on your own versions of non-linear activation functions that you have designed, the high-level overview and in what scenario it performs best.

Spread and share knowledge. If this article piqued your interest, give a few claps as it always motivates me to write out more informative content. For more data science and technology related posts follow me here.
I am also available on Linkedin and occasionally tweet stuff as well. :)