The world’s leading publication for data science, AI, and ML professionals.

Understanding Sigmoid, Logistic, Softmax Functions, and Cross-Entropy Loss (Log Loss)

Practical Maths for Key Concepts in Logistic Regression and Deep Learning

in Classification Problems

Practical Maths for Key Concepts in Classification Problems

Photo by Camylla Battani on Unsplash
Photo by Camylla Battani on Unsplash

· 1. Introduction · 2. Sigmoid Function (Logistic Function) · 3. Logistic Function in Logistic Regression3.1 Review on Linear Regression3.2 Logistic Function and Logistic Regression · 4. Multi-class Classification and Softmax Function4.1 Methods of Multi-class Classifications4.2 Softmax Function · 5. Cross-Entropy Loss and Log Loss5.1 Log Loss (Binary Cross-Entropy Loss)5.2 Derivation of Log Loss5.3 Cross-Entropy Loss (Multi-class)5.4 Cross-Entropy Loss vs Negative Log-Likelihood · 6. Conclusions · About Me · References

1. Introduction

When learning logistic regression and deep learning (neural networks), I always encounter the terms including:

  • Sigmoid function
  • Logistic function
  • Softmax function
  • Log loss
  • Cross entropy Loss
  • Negative log-likelihood

Every time I see them, I did not really try to understand them, because there are existing libraries out there I can use that do everything for me. For example, when I build logistic regression models, I will directly use sklearn.linear_model.LogisticRegression from Scikit-Learn. When I work on deep learning classification problems using PyTorch, I know that I need to add a sigmoid activation function at the output layer with Binary [Cross-Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) for binary classifications, or add a (log) softmax function with Negative Log-Likelihood Loss (or just Cross-Entropy Loss instead) for multiclass classification problems.

Recently, when I revisited these concepts, I found it useful to look into the math and understand what was buried underneath. So, in this post, I gathered materials from different sources and I will demonstrate the mathematical formulas with some explanations.

I have also made a cheat sheet for myself, which can be accessed on my GitHub.

2. Sigmoid Function (Logistic Function)

Sigmoid functions are general mathematical functions that share similar properties: have S-shaped curves, just as the figure below shows.

Members of Sigmoid Functions Family, from Wikipedia
Members of Sigmoid Functions Family, from Wikipedia
The Curve of a Logistic Function, from Wikipedia
The Curve of a Logistic Function, from Wikipedia

The most common sigmoid function used in machine learning is Logistic Function, as the formula below.

Image by author
Image by author

The formula is simple, but it is quite useful because it offers us some nice properties:

  1. It maps the feature space into probability functions
  2. It uses exponential
  3. It is differentiable

For property 1, It is not difficult to see that:

  • When x is really large (goes to infinity), the output will be close to 1
  • When x is really small (goes to -infinity), the output will be close to 0
  • When x is 0, the output will be 1/2

For property 2, a nonlinear relationship ensures most points to be either close to 0 or 1, instead of being stuck in the ambiguous zone in the middle.

Property 3 is also quite important: we need the function to be differentiable to calculate the gradient when updating the weight from errors either using gradient descent in general ML problems or backpropagation in neural networks.

The properties of the logistic function are great, but how is the logistic function used in logistic regression to solve binary classification problems?

3. Logistic Function in Logistic Regression

3.1 Review on Linear Regression

Before going too far, let’s review the concept of regression models. Regression has long been used in statistical modeling and is part of the supervised machine learning methods. It is the process of modeling the relationship between a dependent variable with one or more independent variables.

Example of Simple Linear Regression, from Wikipedia
Example of Simple Linear Regression, from Wikipedia

The most commonly used regression model is linear regression, which predicts values using linear combinations of features. The plot shown above is the simplest form of linear regression, called simple linear regression. It has two parameters β_0 and β_1 where each represents the intercept and slope to define the red best fit line among the data points. With the two parameters trained using the existing data, we will be able to predict a new y value given an unseen x value.

Simple linear regression, Image by author
Simple linear regression, Image by author

With the simplest form defined, we can generalize the linear regression formula to accommodate multiple dimensions of x, which can also be called multiple linear regression (multivariate regression). Basically, it extends to multiple dimensions and uses multiple features (e.g., house size, age, location, etc.) to make predictions (e.g., sale price).

Generalized linear regression model, Image by author
Generalized linear regression model, Image by author

3.2 Logistic Function and Logistic Regression

Besides predicting actual values as regression, the linear regression models can also be used for classification problems by predicting the probability of the subject in a specific class, this can be simply done by replacing y with p:

Image by author
Image by author

The problem is that the probability p here is unbound – it can be any value. So, in order to constrain the probability range to be between 0 and 1, we can use the logistic function introduced in the previous section and map it:

Image by author
Image by author

This will make sure that no matter what the predicted value is, the probability p will be in the range between 0 and 1 with all the advantages introduced earlier. However, the exponential form is not easy to deal with, so we can rearrange the formula using the odds function. Odds is the brother of probability, and it represents the ratio between "success" and "nonsuccess". When the p=0, odds is 0; when p=0.5, odds is 1; when p=1, odds is ∞. The relationships are shown below:

Relationship between Odds and Probability, Image by author
Relationship between Odds and Probability, Image by author

With the odds function defined, we get:

Image by author
Image by author

It is easy to see the similarity between the two equations, so we have:

Image by author
Image by author

We use log to remove the exponential relationship, so it goes back to the term that we are familiar with at the end. The part on the right of the equals sign is still the linear combination of the input x and the parameters β. The part on the left of the equals sign now becomes the logarithm of odds, or giving it a new name logit of probability p. So, the whole equation becomes the definition of the logit function, or log-odds, and it is the inverse function of the standard logistic function. By modeling using the logit function, we have two advantages:

  1. We can still treat it as a linear regression model using our familiar linear function of the predictors
  2. We can use it to predict the true probability of the subject in a class— by transforming the predicted value using the inverse logit function.

That is how logistic regression works behind the hood using the logistic function and is perfectly suitable to make binary classification (2 classes): For class A and B, if the predicted probability of being class A is above the threshold we set (e.g., 0.5), then it is classified as class A; on the other hand, if the predicted probability is below (e.g., 0.5), then it is classified as class B.

We have just covered the binary classification using logistic regression. So, what if there are more than 2 classes?

4. Multi-class Classification and Softmax Function

4.1 Methods of Multi-class Classifications

There are several ways of using binary classifiers to handle multi-class classification problems, and two common ones are: one-versus-the-rest and one-versus-one.

The one-versus-the-rest method trains K-1 binary classifiers to separate each class from the rest. Then the instances which are excluded by all of the classifiers will be classified as class K. It works in many cases, but the biggest issue with the one-versus-the-rest method is ambiguous region: where some instances may be put into multiple classes.

On the other hand, we have the one-versus-one method, which trains a binary classifier between each of the classes. Similar to one-versus-the-rest, the ambiguous region also exists here, but this time there exist instances that are not classified into any of the classes. What’s even worse is the efficiency: we need n choose 2 (combination) classifiers for n classes, as shown below in the equation. For example, if we have 10 classes, we need 45 classifiers to use this method!

Number of classifiers needed for the one-versus-one method, Image by author
Number of classifiers needed for the one-versus-one method, Image by author

With these restrictions on one-versus-the-rest and one-versus-one methods, how can we do multi-class classifications then? The answer is to use the softmax function.

4.2 Softmax Function

The Softmax function is a generalized form of the logistic function as introduced in the binary classification part above. Here is the equation:

Softmax Function, Image by author
Softmax Function, Image by author

To interpret it, we can see it as: the probability of classifying the instance as j can be calculated as the exponential of the j th element of the input divided by the sum of exponentials of all the input elements. To better understand it, we can see the example below:

Example of Applying Softmax Function to Model Output, by Sewade Ogun (Public License)
Example of Applying Softmax Function to Model Output, by Sewade Ogun (Public License)

An image classifier gives numerical output after feeding forward through the neural network, in this case, we have a 3×3 array where rows are instances and columns are classes. The first row contains the predictions of the first image: the scores are 5, 4, and 2 for classes cat, dog, and horse respectively. The numbers do not make sense, so we feed them into a softmax function. By plugging the three numbers into the equation, we can get the probability of the image being a cat, dog, and horse to be 0.71, 0.26, and 0.04, which sums up to 1.

Similar to the logistic function, the softmax function also has the following advantages so that people are widely using it in multi-class classification problems:

  1. It maps the feature space into probability functions
  2. It uses exponential
  3. It is differentiable

Another way to interpret the softmax function is through the famous Bayes Theorem, where:

Bayes Theorem, Image by author
Bayes Theorem, Image by author

Applying it to our case in softmax, all of the terms can be interpreted as probabilities:

Image by author
Image by author

where

Image by author
Image by author

5. Cross-Entropy Loss and Log Loss

When we train classification models, we are most likely to define a loss function that describes how much out predicted values deviate from the true values. Then we will use gradient descent methods to adjust model parameters in order to lower the loss. It is a type of optimization problem, and also called backpropagation in deep learning.

Before we start on this, I strongly recommend the article from Daniel Godoy: Understanding binary cross-entropy / log loss: a visual explanation. It does a really good explanation of the practical math concepts underneath and shows them in a visual way. Here in this post, I am using a little bit different conventions, more like wikipedia.

Let’s get started!

5.1 Log Loss (Binary Cross-Entropy Loss)

One commonly used loss function used in classification problems is called cross-entropy loss, or log loss in binary cases. Let’s first place the expression below:

Log Loss (Binary Cross Entropy), Image by author
Log Loss (Binary Cross Entropy), Image by author

Since the log function has the property that when y is at 0, its log goes to -infinity; when y is at 1, its log is at 0, we can use it to model the loss pretty efficiently. For an instance with true label 0:

  • If the predicted value is 0, then the formula above will return a loss of 0.
  • If the predicted value is 0.5, then the formula above will return a loss of 0.69
  • If the predicted value is 0.99, then the formula above will return a loss of 4.6

As we can see here, the log magnifies the mistake in the classification, so the misclassification will be penalized much more heavily compared to any linear loss functions. The closer the predicted value is to the opposite of the true value, the higher the loss will be, which will eventually become infinity. That’s exactly what we want a loss function to be. So where does the definition of log loss come from?

5.2 Derivation of Log Loss

Cross-Entropy is a concept derived from information theory that measures the difference between two probability distributions, and the definition of it between true probability distribution p and estimated probability q in the information theory is:

Cross-Entropy, Image by author
Cross-Entropy, Image by author

where H(p) is the entropy of distribution p, and _DKL(p||q) is Kullback–Leibler Divergence, a divergence of p from q. It is also called relative entropy, of p with respect to q.

The definition of entropy and Kullback-Leibler Divergence are shown as below:

Definitions of entropy and Kullback-Leibler Divergence, Image by author
Definitions of entropy and Kullback-Leibler Divergence, Image by author

Plugging them in, it is easy to get the expression of cross-entropy:

Image by author
Image by author

For binary classification problems, there are only two classes, so we can express them explicitly:

Image by author
Image by author

Note that the p here is the probability function instead of the distribution p. Also, we can express the true distribution p(y) as 1/N, so the binary cross-entropy (log loss) can be expressed as:

Log Loss (Binary Cross Entropy), Image by author
Log Loss (Binary Cross Entropy), Image by author

Note that a minus sign is placed at the beginning because the log function of values between 0 to 1 gives us negative values. We want to flip the sign so that the loss will be positive – we want to minimize the loss.

If we want, this formula can be further expanded in its expression to include the relationship with the model parameters θ, shown below, but it’s essentially the same as what we have above.

Log Loss with respect to Model Parameters, Image by author
Log Loss with respect to Model Parameters, Image by author

5.3 Cross-Entropy Loss (Multi-class)

After deriving the binary case above, we can easily extend it to multi-class classification problems. Below is a generalized form of the cross-entropy loss function. It only sums the log of the probability when the instance class is k, similar to the binary case, where there is always only part of the expression taken account of, and the others are just 0.

Cross-Entropy Loss (Generalized Form), Image by author
Cross-Entropy Loss (Generalized Form), Image by author

Again, it can also be expressed with respect to the model parameters θ, but it is essentially the same equation:

Cross-Entropy Loss with respect to Model Parameter, Image by author
Cross-Entropy Loss with respect to Model Parameter, Image by author

5.4 Cross-Entropy Loss vs Negative Log-Likelihood

The cross-entropy loss is always compared to the negative log-likelihood. In fact, in PyTorch, the Cross-Entropy Loss is equivalent to (log) softmax function plus Negative Log-Likelihood Loss for multiclass classification problems. So how are these two concepts really connected?

Before we dive into it, we have to understand the difference between probability and likelihood. In short:

  • Probability: Find the chance of some event given a sample distribution of the data
  • Likelihood: Find the best distribution of the data given the sample data

So we are essential modeling the same problems using different expressions, but they are equivalent:

Expression of Likelihood, Image by author
Expression of Likelihood, Image by author

Above is the definition of the likelihood of parameters θ given the data (from x_1 to x_n), which is equivalent to the probability of getting these data (x_1 to x_n) given the parameters θ, and it can be expressed as the product of each individual probability.

Knowing p is the true probability distribution, we can further rewrite the product using the estimated probability distribution as follow:

Image by author
Image by author

where q_i (estimated probability distribution) and p_i (true probability distribution) are:

Image by author
Image by author

where n_i is the number of times i occurs in the training data. Then by taking the negative log on the likelihood, we can get:

Negative Log-Likelihood is Equivalent to Cross-Entropy, Image by author
Negative Log-Likelihood is Equivalent to Cross-Entropy, Image by author

We can easily get the equation above given the log of a product becomes the sum of logs. Magically, the negative log-likelihood becomes the cross-entropy as introduced in the sections above.

6. Conclusions

To summarize the concepts introduced in this article so far:

  • Sigmoid Function: A general mathematical function that has an S-shaped curve, or sigmoid curve, which is bounded, differentiable, and real.
  • Logistic Function: A certain sigmoid function that is widely used in binary classification problems using logistic regression. It maps inputs from -infinity to infinity to be from 0 to 1, which intends to model the probability of binary events.
  • Softmax Function: A generalized form of the logistic function to be used in multi-class classification problems.
  • Log Loss (Binary Cross-Entropy Loss): A loss function that represents how much the predicted probabilities deviate from the true ones. It is used in binary cases.
  • Cross-Entropy Loss: A generalized form of the log loss, which is used for multi-class classification problems.
  • Negative Log-Likelihood: Another interpretation of the cross-entropy loss using the concepts of maximum likelihood estimation. It is equivalent to cross-entropy loss.

Thank you for reading! If you like this article, please follow my channel and/or become my referred member today (really appreciate it 🙏 ). I will keep writing to share my ideas and projects about Data Science. Feel free to contact me if you have any questions.

Join Medium with my referral link – Zhou (Joe) Xu

About Me

I am a data scientist at Sanofi. I embrace technology and learn new skills every day. You are welcome to reach me from Medium Blog, LinkedIn, or GitHub. My opinions are my own and not the views of my employer.

Please see my other articles:

References


Related Articles