Softmax Activation Function Explained

And implemented from scratch.

Dario Radečić
Towards Data Science
4 min readJun 18, 2020

--

If you’ve done any deep learning you’ve probably noticed two different types of activation function — the ones used on hidden layers and the one used on the output layer.

Photo by Aaron Burden on Unsplash

Activation function(s) used on hidden layers are mostly the same for all hidden layers. It’s unlikely to see ReLU used on the first hidden layer, followed by a Hyperbolic tangent function — it’s usually ReLU or tanh all the way.

But we’re here to talk about the output layer. There we need a function that takes whatever values and transforms them into a probability distribution.

Softmax function to the rescue.

The function is great for classification problems, especially if you’re dealing with multi-class classification problems, as it will report back the “confidence score” for each class. Since we’re dealing with probabilities here, the scores returned by the softmax function will add up to 1.

The predicted class is, therefore, the item in the list where confidence score is the highest.

Right now we’ll see how the softmax function is expressed mathematically, and then how easy it is to translate it into Python code.

Mathematical representation

--

--