Deep study of a not very deep neural network. Part 2: Activation functions

Published in

Towards Data Science

13 min readMay 1, 2018

This is the second part of the series, and this time we will explore activation functions. I assume that you already know, what is an activation function, and what role it plays in a neural network.

In our experiment we will be comparing the activation functions included in Keras, specifically:

Linear;
Sigmoid;
Hard Sigmoid;
TanH;
SoftSign;
ReLU;
Leaky ReLU;
Thresholded ReLU;
ELU;
SELU;
SoftPlus.

As in the previous part, here we will stick to RMSProp optimizer. In the later parts of the series we will also evaluate, how various activation functions work with different optimizers, but for now let’s get the first view on the activations. As for the data, we will be training our networks on dataset-wise normalized data (because it performs equally well compared to sample-wise one) and on sample-wise standardized data (you know why, if you have read the previous part).

Below is a brief and not very scientific description of each activation function, just to provide you with an intuitive understanding of each.

Linear

Linear activation (also called Identity) function is one of the simplest possible activation functions. It linearly translates input into output. It is almost never used in training neural networks nowadays both in hidden and in final layers. Its range and domain are equal to [-Inf; +Inf].

Sigmoid

Sigmoid activation function translates the input ranged in [-Inf; +Inf] to the range in (0; 1), and looks like an S-shaped curve. It is generally the first choice when developing simple neural networks for learning purposes, but as of today it is generally being avoided because of its lower quality compared to other activation functions.

Suffers from the Vanishing gradient problem, when “in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value”.

Hard Sigmoid

This function is piece-wise linear approximation of the sigmoid function. It is equal to 0 on the range [-Inf; -2.5), then linearly increases from 0 to 1 on the range [-2.5; +2.5] and stays equal to 1 on the range (+2.5; +Inf] (Source). Computing Hard Sigmoid is considered to be faster than computing regular Sigmoid, because you won’t have to calculate the exponent, and it provides reasonable results on classification tasks. But exactly because it’s an approximation, it shouldn’t be used for regression tasks, as the error will be much higher than that for regular sigmoid.

Hyperbolic Tangent (TanH)

TanH looks much like Sigmoid’s S-shaped curve (in fact, it’s just a scaled sigmoid), but its range is (-1; +1). It has been quite popular before the advent of more sophisticated activation functions. Briefly, the benefits of using TanH instead of Sigmoid are (Source):

Stronger gradients: if the data is centered around 0, the derivatives are higher.
Avoiding bias in the gradients because of the inclusion of the range (-1; 0).

However, similar to Sigmoid, TanH is also susceptible to the Vanishing gradient problem.

Fig. 4 Hyperbolic Tangent (TanH) activation

SoftSign

Works as a continuous approximation of the sign function, and its graph looks very similar to TanH. However, TanH grows exponentially, whereas SoftSign — polynomially. The range of SoftSign is also (-1; +1).

Rectified Linear Unit (ReLU)

A very simple yet powerful activation function, which outputs the input, if the input is positive, and 0 otherwise. It is claimed that it currently is the most popular activation function for training neural networks, and yield better results than Sigmoid and TanH. This type of activation function is not susceptible to the Vanishing gradient problem, but it may suffer from the “Dying ReLU problem”. As stated in Wikipedia: “ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies.” In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high.” Here’s a very good description of this issue: https://www.quora.com/What-is-the-dying-ReLU-problem-in-neural-networks.

This activation function has parameter alpha, which controls the steepness of the line for x < 0 and is set to 0.0. Setting this parameter to any value < 1.0 transforms this activation into Leaky ReLU and setting it to 1.0 makes this function work as Linear activation. What happens, when alpha is > 1.0 will be interesting to investigate.

Fig.6 Rectified Linear Unit (ReLU) activation

Leaky ReLU

A variation of the ReLU function, which allows a small ‘leakage’ of alpha of the gradient for the inputs < 0, which helps to overcome the Dying ReLU problem. By default in Keras alpha is set to 0.3

Thresholded ReLU

Another variant of ReLU activation, where the output is 0 for x < theta, and equals to x if x >= theta. In Keras the default value for theta is set to 1.0, whereas in the original paper the value of 0.7 is said to provide the best results for their particular experiment.

Exponential Linear Unit (ELU)

Less widely-used modification of ReLU, which is said to lead to higher classification results than traditional ReLU. It follows the same rule for x>= 0 as ReLU, and increases exponentially for x < 0. ELU tries to make the mean activations closer to zero which speeds up training.

It has just one parameter alpha, which controls the scale of the negative part, and by default is set to 1.0.

Fig.9 Exponential Linear Unit (ELU) activation

Scaled Exponential Linear Unit (SELU)

The last rectifier we will be evaluating in our experiment. It extends ELU with parameter lambda, responsible for scaling both positive and negative parts. Alpha and lambda are hardcoded in this function are are roughly equal to 1.67 and 1.05 respectively, as proposed in the paper by Klambauer et al. They also say that this activation should be used together with “lecun_normal” initializer and AlphaDropout, but for the sake of comparability with other activations in this part we will use default initializer and regular Dropout. We will check the proposed initializer and dropout for SELU later in the series.

SoftPlus

SoftPlus function’s graph looks like smoothened ReLU. I couldn’t find the exact reasons for why SoftPlus should be preferred over any other activation function, and this is supported by the statement from Deep Sparse Rectifier Neural Networks paper by Glorot, Bordes and Bengio: “Despite the hard threshold at 0, networks trained with the rectifier activation function can find local minima of greater or equal quality than those obtained with its smooth counterpart, the softplus.”

Just for your convenience here are all activation functions combined in two graphs:

Fig.12 Comparison of various activation functions

Experiment Results

When comparing the activation functions we will consider the same indicators as in the previous experiment. After we have trained our network with RMSProp optimizer with each of the activation function 5 times, here’s what we get for the normalized dataset (sorted by Overall Max Achieved Validation Accuracy):

There are clear underperformers here: Linear activation, Thresholded ReLU with default theta value and Leaky ReLU with very large alpha. Linear activation wasn’t able to pass the 93% accuracy level in all four measures. Thresholded ReLU has the lowest accuracy value at the final epoch, and one of the lowest maximum achieved accuracy, which means that there is overfitting. Widely used ReLU demonstrated average results. The clear winners here are ELU and SELU activations.

And this is the same table for the sample-wise standardized data:

Here the rankings are basically the same with some minor movements in the middle of the table. However, in general the performance of the networks trained with standardized data are slightly worse. The only exception is for Thresholded ReLU, where the results have improved significantly.

Now let’s compare the two data transformation ways more closely:

(Sorry for a very wide table. Scroll right to see the rest of it.)

On average with normalized data you will be able to achieve slightly better results. There are a few activations, which (if for whatever reason you decide to use them) perform better on standardized data. However, there is one interesting thing to note here: for significant amount of the activations the maximum accuracy value is achieved earlier with standardized data, than with normalized data. So if for your particular experiment you can sacrifice some accuracy in order to reach the maximum results faster, standardized data is the way to go.

Now let’s investigate closer the details of training the network with each activation function, so you can clearly see the difference in their training behaviour. (Beware, lots of graphs!)

Linear

Linear activation has shown the worst results. As we can see from the image below, the training has been very unstable, especially for the standardized data. The vertical lines showing the moment, when the maximum validation accuracy in each experiment has been achieved, are spread across the entire x-axis. This means that after a certain point, the optimizer cannot find a better local minima, jumping back and forth. This can be solved by reducing the learning rate (which we will explore later in the series), but it is also the problem of the linearity of the model, which cannot model too complex dependencies.

Sigmoid

Sigmoid activation is resulting in a much more stable model, than with Linear one, and achieved maximum validation accuracy values are closer to the end of the training, but the validation accuracy value is average.

Hard Sigmoid

Very similar to the plain Sigmoid, it has lower final average value and lower maximum average, but the maximum achieved validation accuracy is exactly the same as for Sigmoid. So for this particular setting, we can say that Hard Sigmoid performs worse than Sigmoid. The maximum values on the normalized data are closer to the end of the training, which tells us that if we adjust the learning rate, it may achieve better results.

Hyperbolic Tangent (TanH)

Despite having roughly the same Maximum achieved validation accuracy as for Sigmoid, TanH is a bit less stable. The majority of the local minima have been reached closer to the middle of the training, and the optimizer wasn’t able to improve the results further. The model with this activation function may also benefit from decreasing the learning rate. It is also interesting to note that despite TanH is perceived as more advanced than Sigmoid, and is used nowadays much more often, the latter may still be more applicable in certain network set-ups and tasks.

Fig.16 Validation accuracy for models trained with TanH activation

SoftSign

On normalized data all lines follow the average very closely. But the results of the training are also somewhat average. With standardized data SoftSign is much less stable despite demonstrating slightly higher final accuracy.

Rectified Linear Unit (ReLU)

This is the first time we see overfitting in our experiment. As we can see, the models reach their peak performance between the 10th and the 40th epochs, and then start slowly decreasing. Maximum achieved validation accuracy is identical to that of Sigmoid for normalized data, and lower for standardized. So without further fine-tuning Sigmoid beats ReLU here.

Fig.18 Validation accuracy for models trained with ReLU activation

Leaky ReLU

Alpha = 0.3 (default)

Leaky ReLU has shown worse performance than its traditional variant — ReLU. Both the maximum validation accuracy and the accuracy at the last epoch are lower than those of ReLU. Which means that even with overfitting, ReLU is more preferable for our case.

Fig.19 Validation accuracy for models trained with Leaky ReLU activation with alpha = 0.3

Alpha = 0.01

Reducing the ‘leakage’ parameter alpha helped the model to significantly improve the results both on normalized and standardized data.

Alpha = 1.5

Setting alpha to a relatively large value resulted in one of the worst performances in out experiment. The training was highly unstable, and the accuracy was very low. So don’t do that.

Thresholded ReLU

Theta = 0.7

This is a very interesting case. The models trained with Thresholded ReLU with normalized data have quickly reached the maximum values, and then started to decrease. So it’s a clear overfitting, and also very bad overall performance. Whereas on the standardized data, despite still underperforming when compared to other activations, there was no overfitting at all. We see that the theta value proposed in the original paper does not work well for normalized data. And this is probably the best demonstration of how different may the models perform with the two data transformation approaches.

Theta = 1.0 (default)

This theta value has resulted in even worse performance. Which means that you should use the default values built into the deep learning frameworks with caution, and always check if changing them leads to better results.

Fig.23 Validation accuracy for models trained with Thresholded ReLU activation with theta = 1.0

Theta = data mean

Significant improvement both in terms of accuracy and overfitting, but still overall lower than average performance. This example demonstrates that tuning the model parameters according to the data you feed in is incredibly important.

Exponential Linear Unit (ELU)

Alpha = 1.0 (default)

Quite stable and one of the best in terms of maximum achieved validation accuracy value. The maxima have been reached in the middle of the training, so it can get to the best results faster, and potentially with some fine-tuning of the learning rate these results can be further improved. This activation performed high both on standardized and normalized data.

Fig.25 Validation accuracy for models trained with ELU activation with alpha = 1.0

Alpha = 0.5

Also very good performance, at least on the normalized data.

Alpha = 1.5

ELU with alpha = 1.5 was among the leaders across all activation functions on the standardized data. It performed almost as high as SELU. The maxima values on the normalized data are very close to the end of the training, so probably, if further trained, it could have achieved even better results.

Scaled Exponential Linear Unit (SELU)

Second-best activation function, quite stable and demonstrates higher performance. Slightly unstable on standardized data. Later we will check, if it is possible to improve the results by reducing the learning rate and tuning the dropout in order to stabilize the training and achieve higher accuracy.

Fig.28 Validation accuracy for models trained with SELU activation

SoftPlus

This function was the third in terms of the maximum achieved validation accuracy (after SELU and ELU) on the normalized data, but with a large gap separating it from the leaders. On standardized data there was overfitting and lower than average performance.

If we compare SoftPlus with the results of ReLU, we can see that the statement about the “greater or equal quality” of ReLU compared to SoftPlus has not been confirmed for our particular setting. It supports the widely accepted idea that benchmarking neural networks’ components is difficult and leads to contradicting results in different network set-ups.

Fig.29 Validation accuracy for models trained with SoftPlus activation

Summary

The example with SoftPlus beating ReLU contrary to what the fathers of Deep Learning have said in their paper mean that the rankings of the activation functions that we received in this experiment and the results are only applicable to the specific configuration of the neural network we are considering, and in general do not tell you that one activation function is better than another. But at least for the 3-layered fully-connected network with RMSProp optimizer on the image classification task we can say that by using Exponential Linear Unit activation or its Scaled variation you will be able to achieve better results, than with other activations.

To sum up, the main learning points from this experiment is that for similar task and neural network set up you should:

Normalize the data in order to achieve higher validation accuracy, and standardize if you need the results faster;
Use ELU or SELU activations;
Try tuning the activation function parameters to see if they can produce better results;
And yes, never use Linear activation.

You can find the code for the experiments and for visualizing the results on my github. In the next part we will expand our experiment and test other optimizers in the same way to see, how they perform in combination with the same set of activation functions.

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

Part 1: What’s in our data
Part 2: Activation functions
Part 3a: Optimizers overview
Part 3b: Choosing an optimizer
Part 4: How to find the right learning rate
Part 5: Dropout and Noise
Part 6: Weights initialization
Part 7: Regularization
Part 8: Batch normalization
Part 9: Size matters
Part 10: Merging it all together

Deep study of a not very deep neural network. Part 2: Activation functions

Linear

Sigmoid

Hard Sigmoid

Hyperbolic Tangent (TanH)

SoftSign

Rectified Linear Unit (ReLU)

Leaky ReLU

Thresholded ReLU

Exponential Linear Unit (ELU)

Scaled Exponential Linear Unit (SELU)

SoftPlus

Experiment Results

Linear

Sigmoid

Hard Sigmoid

Hyperbolic Tangent (TanH)

SoftSign

Rectified Linear Unit (ReLU)

Leaky ReLU

Alpha = 0.3 (default)

Alpha = 0.01

Alpha = 1.5

Thresholded ReLU

Theta = 0.7

Theta = 1.0 (default)

Theta = data mean

Exponential Linear Unit (ELU)

Alpha = 1.0 (default)

Alpha = 0.5

Alpha = 1.5

Scaled Exponential Linear Unit (SELU)

SoftPlus

Summary

Written by Rinat Maksutov