Exploring Activation Functions for Neural Networks

Dima Shulga
Towards Data Science
8 min readJun 25, 2017

--

In this post, I want to give more attention to activation functions we use in Neural Networks. For this, I’ll solve the MNIST problem using simple fully connected Neural Network with different activation functions.

MNIST data is a set of ~70000 photos of handwritten digits, each photo is of size 28x28, and it’s black and white. That means that our input data shape is (70000,784) and our output (70000,10).

I will use a basic fully connected Neural Network with a single hidden layer. It looks something like this:

There’re 784 neurons in the input layer, one for each pixel in the photo, 512 neurons in the hidden layer, and 10 neurons in the output layer, one for each digit.

In keras, we can use different activation function for each layer. That means that in our case we have to decide what activation function we should be utilized in the hidden layer and the output layer, in this post, I will experiment only on the hidden layer but it should be relevant also to the final layer.

There are many activation functions, I’ll go over only the basics: Sigmoid, Tanh and Relu.

First, let’s try to not to use any activation function at all. What do you think will happen? here’s the code (I’m skipping the data loading part, you can find the whole code in this notebook):

model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Dense(10, activation='softmax'))

As I said, 784 input, 512 in the hidden layer and 10 neurons in the output layer. Before training, we can look at the network architecture and parameters using model.summary and model.layers:

Layers (input ==> output)
--------------------------
dense_1 (None, 784) ==> (None, 512)
dense_2 (None, 512) ==> (None, 10)
Summary
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 401920
_________________________________________________________________
output (Dense) (None, 10) 5130
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________
None

Ok now we’re sure about our network’s architecture, let’s train for 5 epochs:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 3s - loss: 0.3813 - acc: 0.8901 - val_loss: 0.2985 - val_acc: 0.9178
Epoch 2/5
60000/60000 [==============================] - 3s - loss: 0.3100 - acc: 0.9132 - val_loss: 0.2977 - val_acc: 0.9196
Epoch 3/5
60000/60000 [==============================] - 3s - loss: 0.2965 - acc: 0.9172 - val_loss: 0.2955 - val_acc: 0.9186
Epoch 4/5
60000/60000 [==============================] - 3s - loss: 0.2873 - acc: 0.9209 - val_loss: 0.2857 - val_acc: 0.9245
Epoch 5/5
60000/60000 [==============================] - 3s - loss: 0.2829 - acc: 0.9214 - val_loss: 0.2982 - val_acc: 0.9185
Test loss:, 0.299
Test accuracy: 0.918

We’re getting not so great results, 91.8% accuracy on MNIST dataset is pretty bad. Of course you can say that we need much more than 5 epochs, but let’s plot the losses:

You can see that the validation loss is not improving, and I can assure you that even after 100 epochs it won’t improve. We can try different techniques to prevent overfitting, or make our network bigger and smarter in order to learn better and improve, but lets just try using sigmoid activation function.

Sigmoid function looks like this:

It squashes the output into a (0,1) interval and it’s non linear. Let’s use it in our network:

model = Sequential()
model.add(Dense(512, activation='sigmoid', input_shape=(784,)))
model.add(Dense(10, activation='softmax'))

You can see that the architecture is exactly the same, we changed only the activation function of the Dense layer. Let’s train again for 5 epochs:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 3s - loss: 0.4224 - acc: 0.8864 - val_loss: 0.2617 - val_acc: 0.9237
Epoch 2/5
60000/60000 [==============================] - 3s - loss: 0.2359 - acc: 0.9310 - val_loss: 0.1989 - val_acc: 0.9409
Epoch 3/5
60000/60000 [==============================] - 3s - loss: 0.1785 - acc: 0.9477 - val_loss: 0.1501 - val_acc: 0.9550
Epoch 4/5
60000/60000 [==============================] - 3s - loss: 0.1379 - acc: 0.9598 - val_loss: 0.1272 - val_acc: 0.9629
Epoch 5/5
60000/60000 [==============================] - 3s - loss: 0.1116 - acc: 0.9673 - val_loss: 0.1131 - val_acc: 0.9668

Test loss: 0.113
Test accuracy: 0.967

That’s much better. In order to understand why, let’s recall how our neuron looks like:

Where x is the input, w are the weights and b is the bias. You can see that this is just a linear combination of the input with the weights and the bias. Even after stacking many of those, we will still be able to represent it as a single linear equation. That means, that it’s similar to a network without hidden layers at all, and this is true for any number of hidden layers (!!). We’ll add some layers to our first network and see what happens. it looks like this:

model = Sequential()
model.add(Dense(512, input_shape=(784,)))
for i in range(5):
model.add(Dense(512))
model.add(Dense(10, activation='softmax'))

Here’ how the network looks like:

dense_1 (None, 784) ==> (None, 512)
dense_2 (None, 512) ==> (None, 512)
dense_3 (None, 512) ==> (None, 512)
dense_4 (None, 512) ==> (None, 512)
dense_5 (None, 512) ==> (None, 512)
dense_6 (None, 512) ==> (None, 10)

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 401920
_________________________________________________________________
dense_2 (Dense) (None, 512) 262656
_________________________________________________________________
dense_3 (Dense) (None, 512) 262656
_________________________________________________________________
dense_4 (Dense) (None, 512) 262656
_________________________________________________________________
dense_5 (Dense) (None, 512) 262656
_________________________________________________________________
dense_16 (Dense) (None, 10) 5130
=================================================================
Total params: 1,720,330
Trainable params: 1,720,330
Non-trainable params: 0
_________________________________________________________________
None

These are the results after training for 5 epochs:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 17s - loss: 1.3217 - acc: 0.7310 - val_loss: 0.7553 - val_acc: 0.7928
Epoch 2/5
60000/60000 [==============================] - 16s - loss: 0.5304 - acc: 0.8425 - val_loss: 0.4121 - val_acc: 0.8787
Epoch 3/5
60000/60000 [==============================] - 15s - loss: 0.4325 - acc: 0.8724 - val_loss: 0.3683 - val_acc: 0.9005
Epoch 4/5
60000/60000 [==============================] - 16s - loss: 0.3936 - acc: 0.8852 - val_loss: 0.3638 - val_acc: 0.8953
Epoch 5/5
60000/60000 [==============================] - 16s - loss: 0.3712 - acc: 0.8945 - val_loss: 0.4163 - val_acc: 0.8767

Test loss: 0.416
Test accuracy: 0.877

This is very bad. We can see that the network is unable to learn want we want. This is because without non-linearity our network is just a linear classifier and not able to acquire nonlinear relationships.

On the other hand, sigmoid is a non linear function, and we can’t represent it as a linear combination of our input. That’s what brings non linearity to our network and ables it to learn non linear relationships. Let’s try train the 5 hidden layer network again, this time with sigmoid activations:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 16s - loss: 0.8012 - acc: 0.7228 - val_loss: 0.3798 - val_acc: 0.8949
Epoch 2/5
60000/60000 [==============================] - 15s - loss: 0.3078 - acc: 0.9131 - val_loss: 0.2642 - val_acc: 0.9264
Epoch 3/5
60000/60000 [==============================] - 15s - loss: 0.2031 - acc: 0.9419 - val_loss: 0.2095 - val_acc: 0.9408
Epoch 4/5
60000/60000 [==============================] - 15s - loss: 0.1545 - acc: 0.9544 - val_loss: 0.2434 - val_acc: 0.9282
Epoch 5/5
60000/60000 [==============================] - 15s - loss: 0.1236 - acc: 0.9633 - val_loss: 0.1504 - val_acc: 0.9548

Test loss: 0.15
Test accuracy: 0.955

Again, that’s much better. We’re probably overfitting, but we got a significant boost in performance just by using activation function.

Sigmoid is great, and it has many positive properties like nonlinearity, differentiability and the (0,1) range gives us a probability of return values which is nice, but it has its drawbacks. When we use back-propagation, we must back-propagate the derivative from our output back to our first weights, In other words, we want to pass our classification/regression error in the final output value back through the whole network. That means that we should derive our layers and update the weights. The problem with sigmoid, that it’s derivative looks like this:

You can see that the max value of the derivative is pretty small (0.25), that mean that we will pass only a small fraction of our error to the previous layers. That may cause our network to learn slow (by slow I mean we’ll need more data or epochs, not computation time).

To solve that, we can use the Tanh function, which looks like this:

tanh function is also non-linear and differentiable. Its output is in the (-1,1) range wich is not as nice as the (0,1) range, but it’s still ok for the hidden layers. And finally, it’s maxed derivative is one which is great because now we can better pass our error through the layers.

To use the tanh activation function, we just need to change the activation attribute of the Dense layer:

model = Sequential()
model.add(Dense(512, activation=’tanh’, input_shape=(784,)))
model.add(Dense(10, activation=’softmax’))

Again, the network architecture is the same, only the activation is different. Let’s train for 5 epochs:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 5s - loss: 0.3333 - acc: 0.9006 - val_loss: 0.2106 - val_acc: 0.9383
Epoch 2/5
60000/60000 [==============================] - 3s - loss: 0.1754 - acc: 0.9489 - val_loss: 0.1485 - val_acc: 0.9567
Epoch 3/5
60000/60000 [==============================] - 3s - loss: 0.1165 - acc: 0.9657 - val_loss: 0.1082 - val_acc: 0.9670
Epoch 4/5
60000/60000 [==============================] - 3s - loss: 0.0843 - acc: 0.9750 - val_loss: 0.0920 - val_acc: 0.9717
Epoch 5/5
60000/60000 [==============================] - 3s - loss: 0.0653 - acc: 0.9806 - val_loss: 0.0730 - val_acc: 0.9782

Test loss: 0.073
Test accuracy: 0.978

Great! We improved our test accuracy by more than 1% only by using different activation function.

Can we do better? It appears that in most cases we can use the relu activation function. Relu looks like this:

The range of this activation function is (0, inf), and it’s not differentiable at zero (there are solutions to this). The nicest thing about relu is that it’s gradient is always equal to 1, this way we can pass the maximum amount of the error though the network during back-propagation.

Let’s train and see the results:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 5s - loss: 0.2553 - acc: 0.9263 - val_loss: 0.1505 - val_acc: 0.9516
Epoch 2/5
60000/60000 [==============================] - 3s - loss: 0.1041 - acc: 0.9693 - val_loss: 0.0920 - val_acc: 0.9719
Epoch 3/5
60000/60000 [==============================] - 3s - loss: 0.0690 - acc: 0.9790 - val_loss: 0.0833 - val_acc: 0.9744
Epoch 4/5
60000/60000 [==============================] - 4s - loss: 0.0493 - acc: 0.9844 - val_loss: 0.0715 - val_acc: 0.9781
Epoch 5/5
60000/60000 [==============================] - 3s - loss: 0.0376 - acc: 0.9885 - val_loss: 0.0645 - val_acc: 0.9823

Test loss: 0.064
Test accuracy: 0.982

We got the best result this far. 98.2% It’s not a bad result, and we did it only using a single hidden layer.

It is important to note that there’s no best activation function. One may be better than other in many cases, but will be worse in some other cases.

Another important note is that using different activations function doesn’t affect what our network can learn, only how fast (how many data/epochs it needs). Here’s a plot of all of the activation function we tried but this time for a much longer training period. You can see that all the activation functions are achieve~98% accuracy eventually.

Hope you enjoyed the post. You can find the code here.

--

--