Sigmoid and SoftMax Functions in 5 minutes

The math behind two of the most used activation functions in Machine Learning

Gabriel Furnieles
Towards Data Science

--

Photo by Tomáš Malík on Unsplash. Edited by author

The Sigmoid and SoftMax functions define activation functions used in Machine Learning, and more specifically in the field of Deep Learning for classification methods.

Activation function: Function that transforms the weighted sum of a neuron so that the output is non-linear

Note. The sigmoid function is also called The Logistic Function since it was first introduced with the algorithm of Logistic regression

Both functions take a value Χ from the range of the real numbers R and output a number between 0 and 1 that represents the probability of Χ of belonging to a certain class.

Notation: P(Y=k|X=x) is read as “The probability of Y being k given the input X being x”.

Figure 1. Illustration of Sigmoid and SoftMax function. The output is read as “The probability of Y being the class k given the input X”. Image by author

But if both functions map the same transformation (i.e. do the same thing), what is the difference between them?
Sigmoid is used for binary classification methods where we only have 2 classes, while SoftMax applies to multiclass problems. In fact, the SoftMax function is an extension of the Sigmoid function.
Therefore, the input and output of both functions are slightly different since Sigmoid receives just one input and only outputs a single number that represents the probability of belonging to class1 (remember that we only have 2 classes so the probability of belonging to class2 = 1 - P(class1)). While on the other hand SoftMax is vectorized, meaning that takes a vector with the same number of entries as classes we have and outputs another vector where each component represents the probability of belonging to that class.

Figure 2. Illustration of the outputs of each function. An important property is the addition of all the probabilities for each Sigmoid class and SoftMax should be equal to 1. In the case of Sigmoid we obtain P(Y=class2|X) = 1 - P(Y=class1|X). Image by author

We already know what each function does and in which cases to use them. The only thing left is the mathematical formulation (More math notation!)

The mathematical formulation of Sigmoid and SoftMax functions

Sigmoid function

Imagine our model outputs a single value X that can take any value from the real numbers X ∈ (-∞,+∞) and we want to transform that number into a probability P ∈ [0,1] that represents the probability of belonging to the first class (we just have 2 classes).

However, to solve this problem we must think in the opposite way. How do I transform a probability P ∈ [0,1] into a value X ∈ (-∞,+∞)?
Although it seems illogical, the solution lies in horse betting (Mathematicians have always liked games).

In horse betting, there is a commonly used term called odds [1]. When we state that the odds of horse number 17 winning the race are 3/8 we are actually saying that after 11 races the horse will win 3 of them and lose 8. Mathematically the odds can be seen as a ratio between two independent events and are expressed as:

Odds formula

The odds can take any positive value and therefore have no ceiling restriction [0,+∞). However, if we take the log-odd we find that the range value changes to (-∞, +∞). The log of the odds is called the logit function:

Logit function formula. Maps probabilities from (0,1) to the entire realm of real numbers (-∞, +∞)

Finally, the function that we were looking for, i.e. the Logistic function or SIGMOID FUNCTION, is the inverse of the logit (maps values from the range (-∞, +∞) into [0,1])

Computing the inverse of the logit function

Thus obtaining the formula:

Sigmoid function formula.

Where X denotes the input (in the case of neural networks the input is the weighted sum of the last neuron, usually represented by z = x1·w1 + x2·w2 + … + xn·wn)

SoftMax function

On the other hand, we’ve seen that SoftMax takes a vector as input. This vector has the same dimension as classes we have. We will call it X (although another common notation in neural networks is Z, where each element of the vector is the output of the penultimate layer)

Input vector for k classes

Same as with the Sigmoid function, the input belongs to the Real values (in this case each of the vector entries) xi ∈ (-∞,+∞) and we want to output a vector where each component is a probability P ∈ [0,1]. Moreover, the output vector must be a probability distribution over all the predicted classes, i.e. all the entries of the vector must add up to 1. This restriction can be translated as each input must belong to one class and just to one.

We can think about X as the vector that contains the logits of P(Y=i|X) for each of the classes since the logits can be any real number (here i represent the class number). Remember that logit ∈ (-∞, +∞)

The logit of the probability of Y belonging to the i th class, given x

However, unlike in the binary classification problem, we cannot apply the Sigmoid function. The reason is that when applying Sigmoid we obtain isolated probabilities, not a probability distribution over all predicted classes, and therefore the output vector elements don’t add up to 1 [2].

Figure 3. Why sigmoid function can’t be used for multiclass classification. Notice that the output vector elements don’t add up to 1. Image by author

To convert X into a probability distribution we can apply the exponential function and obtain the odds ∈ [0,+∞)

Remember that X is a vector and therefore log(odds) and odds are also vectors

After that, we can see that the odd is a monotone increasing function over the probability. So when the probability increases the odd does the same in an exponential way [2].

Figure 4. Plot of the odd function. Screenshot from Geogebra

Therefore, we can use the odd (or its equivalent exp(logit)) as a score to predict the probability, since the higher the odd the higher the probability.

Finally, we can just normalize the result by dividing by the sum of all the odds, so that the range value changes from [0,+∞) to [0,1] and we make sure that the sum of all the elements is equal to 1, thus building a probability distribution over all the predicted classes.

SoftMax function formula

Now, if we take the same example as before we see that the output vector is indeed a probability distribution and that all its entries add up to 1

Figure 5. Using SoftMax we obtain a probability distribution over all the predicted classes. Note: The results have been approximated to 3 decimal places to facilitate reading. Image by author

References and resources

[1] The logit function. Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin.

[2] Youtube. logit and softmax in deep learning. Minsuk Heo. 2019

--

--