I recently wrote a short article about how artificial neurons work and some people have asked me, if I could do something similiar about Activation Functions. So here it goes, my two cents on this topic!
What are activation functions?

If you have read my previous article, do you remember how a biological neuron is charging up with electricity before it passes its signal to the next neuron? Each of our biological cells has a specific treshhold that determines, if the neuron "fires" and transfers its signal to the next neuron. Within an artificial neuron, we replicate this behaviour by implementing an activation function that holds a specific mathematical treshhold. If the sum of our weighted inputs exceeds the value of the activation function’s treshhold, the signal will be passed to the next neuron. In this case, the artificial neuron will be considered as being "activated".
Why are activation functions so important?
I said we will use as little math as possible. So instead of giving you some fancy mathematical equations, we will conduct a short thought experiment:
You come home from a long day at work and your signifcant other has prepared your favorite dish. The aroma is fantastic! It looks great and omg – it tastes like heaven! You dig right in, eat more than you could and you are just happy that you have such a wonderful life.
So what is happening in your brain at this moment? Right, the neurons in charge of identifying this dish are firing as hell. If we would try to replicate this behavior within an artificial neuronal network, we could determine the neurons’ activation functions so that they would fire everytime you smell, see or taste this dish. This behaviour would be step wise, which means the artificial neuron would either fire or not fire.
But is that right? Do you really always feel the same about your favorite dish? Do you just want to dig into this dish any given moment in your life? Let us go a litle further in our thought experiment:
You just ate a load of food because one of your co-workes became a dad and sponsored pizza and drinks for everyone. You come home and your significant other presents you your favorite dish. You appreciate the gesture and you love this dish but you just simply can’t eat anymore.
Does that mean you don’t like your favorite dish? Of course not, it still smells great and you know you love it but your stomach tells you that you can’t eat anymore. So your feelings in this moment are a little bit more complex than just loving or not loving it, right?
So we would have to adjust the activation functions in a way that theay can replicate this behaviour. We could make the activation functions linear. In this case our artificial neuronal network would be able to replicate that you love your dish a little bit but not that much because you are currently full.
Now, linear functions bear a mathematical issue. If you apply the outcome of one linear function to another linear function the output of that operation is still linear. In other words, the outcome will always grow with the same proportion. In terms of our little example, this means that you would always feel exactly the same "little bit" excited about your dish even when you are full. Does that sound correct? Let us take one last leap on our thought experiment:
You ate a lot of pizza at your coworker’s celebration party but you have not eaten your favorite dish for over a month. You are full but you are super excited about the fact that you get your favorite dish from the person who matters most to you!
Do you feel a little bit excited about your dish because you are full or do you feel more excited about it? Right, you love it more in this scenario than in the one before. We can conclude that our feelings towards our favorite dish are not proportional. In result, a linear activation function is no longer a suitable choice – we need to switch to non-linear activation functions.
So why are activation functions so important? Without them, an artificial neuron’s output would equal the sum of its weighted input. The signal would always be passed to the next neuron and all neurons would be activated all the time. By using activation functions, we can introduce thresholds into our artificial neurons that prevents neurons from being activated, if their information can be considered irrelevant. Activation functions also enable an artificial neuron to solve more complex problems by introducing non-linearity.
How do common activation functions look like?
This section will give you a more detailed explanation of different activation functions, their use cases and limitations.
Step-wise
You probably guessed it already – the simplest types of activation functions are the so-called step functions.

Step functions add a threshold to the activation function (therefore also called "threshold functions"). If the weighted sum of inputs is above a certain value, the neuron is declared as activated. If the sum is below the value, it doesn’t get activated. These function are thus limited to binary classification problems (yes or no) but cannot replicate more complex scenarios.¹
Linear
So what can we do to solve this issue? Correct, we can use a linear function. Linear functions can provide intermediate activation values such as "50% activated" or "20% activated". This way a network of artificial neurons could distinguish between different outputs and decide upon the highest activation value.¹

However, as mentioned before, if you multiply the output of several linear functions with each other, the output of the first multiplication will have increased by the same amount as the last outcome. That is not necessarily a problem but artificial neuronal networks often consist out of multiple layers of artificial neurons. Simply put, an artificial neuronal network learns by calculating the change in error while information is passing through the network (also called gradient descent). When you use a linear function in such a construct, the change in error will always be the same after the information has passed the first layer. Why? Because the outcome will always increase the same way, hence the error will always increase the same way. In result, using a linear activation function makes it impossible for a multi-layer network to learn anything new after the first layer.¹
Logistic sigmoid
You are probably guessing it already: To overcome these issues, a non-linear function can be used. One example for a nonlinear function is the so called "logistic sigmoid" function. Due to their nonlinear nature, the logistic sigmoid function enables the stacking of unlimited amounts of layers within an artificial neuronal network which makes this function suitable for solving more complex problems. It also incorporates the benefits of a linear function by giving intermediate activations. As shown in the figure, a logistic sigmoid function has a "S" shape which means that the change in error increases with the function getting steeper towards x = 0. Consequently, small changes of x in this region also bring about large changes in the value of y and therefore benefit training algorithms based on Gradient Descent.¹

The logistic sigmoid function is most likely one of the most commonly used activation functions due to its benefits. However, the sigmoid function has the disadvantage of being very flat towards the end of the defined ranges for x (in this case -1 and 1). This means that the gradient is becoming very small once the function falls in one of these regions. In result, the gradient is approaching zero and the Network is not able to learn.¹
Another problem of the sigmoid function is that the y values can only be positive because the function is not symmetric around the origin. Consequently, the following neurons could only receive input in form of positive values which is (depending on the problem statement) not desirable. If a network would be trained on e.g. price data with a changing positive and negative course, the network would be able to interpret the negative input prices but would not be able to represent the prices accordingly in its output as negative.¹
Tanh
This issue, however, can be addressed by scaling the sigmoid function. This type of activation function is called the tanh function.

The tanh function works like the sigmoid function, but its centre is symmetric to the origin. As it ranges from -1 to 1, it solves the issue with the positive input. However, even though it carries all the positive benefits of the sigmoid function, it still suffers from something called the "vanishing gradient problem" in the flat levels of the curve.²
In short, it describes the problem that the higher an input, the smaller the gradient. So the more layers your network has, the higher/smaller your following inputs will be and hence the lower your gradient will be. And since a small gradient means that a network is not learning much, it also means that this kind of functions are not suitable for networks with a large amount of layers.²
ReLu
An alternative to the tanh function is the rectified linear unit or "ReLu".

ReLu functions are non-linear which makes training more effective and enables multiple layers of neurons being activated by the function. However, the main advantage of using the ReLu function over others is that it converts all negative inputs into zero. In result, the network only activates nurons with positive sums which makes it computationally more efficient. However, ReLu functions suffer from the vanishing gradient problem as well since the gradient on the negative side of the graph equals zero. Consequently, networks will not learn, if the gradient happens to be in this region during activation. This phenomenon is also referred to as "dying ReLu".³
Leaky ReLu
The Leaky ReLu function solves the dying ReLu problem by representing negative input as a linear component of the function’s input. Consequently, negative inputs remain and the the zero gradient issue is removed as the gradient on the left side of the graph does not equal zero anymore.³

Wrap up time!
Activation function are used within artificial neurons to replicate the behaviour of a biological neuron by implementing an activation function that holds a specific mathematical treshhold. If the threshold is exceeded, the signal will be passed to the next neuron. In this case, the artificial neuron will be considered as being "activated".
Activation functions are important as they give an artificial neuron the ability to only being activated, if the presented input is meaningful. They also give us the possibility to adjust an artificial neuron’s learning ability and add more complexity to the network.
There are many different activation functions with their own benefits and limitations. Research suggests that there is no universal activation function that perfectly fits every problem. Today, there are more than 27 different standard activation functions which can be used within artificial neural networks. In general, artificial neurons train better with non-linear activation functions. However, choosing the right activation function highly depends on the mathematical problem you are trying to solve. Sigmoid and Tanh are said to work well in the case of classifcation problems despite their vanishing gradient vulnerability. Neural networks using ReLu or Leaky ReLu functions on the other hand have been proven to perform well on different types of problems in general. While they can be outperformed by networks using more specialized approaches (i.e. ELUs, SELUs, SoftExponential, etc.) they are the most commonly used type of activation functions due to their advantages over sigmoid as well as their general high training efficiency.
If you want to learn more about activation functions, gradient descent, and how artificial neuronal networks learn, I highly recommend to have a detailed look at this article. But be aware – it involves some math! 😉
Sources
- Ruland. 2004. Einführung in Neuronale Netze, 2–13. Ulm: Universität Ulm.
- Luhaniwal. 2019. „Analyzing different types of activation functions in neural networks – which one to prefer?" Towards Data Science. Last modified May 07, 2019. https://towardsdatascience.com/analyzing-different-types-of-activation-functions-inneural-networks-which-one-to-prefer-e11649256209
- Gupta. 2020. "Fundamentals of Deep Learning – Activation Functions and When to Use Them?" Analytics Vidhya. Last modified January 30, 2020. https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/