in Classification Problems
Practical Maths for Key Concepts in Classification Problems

· 1. Introduction · 2. Sigmoid Function (Logistic Function) · 3. Logistic Function in Logistic Regression ∘ 3.1 Review on Linear Regression ∘ 3.2 Logistic Function and Logistic Regression · 4. Multi-class Classification and Softmax Function ∘ 4.1 Methods of Multi-class Classifications ∘ 4.2 Softmax Function · 5. Cross-Entropy Loss and Log Loss ∘ 5.1 Log Loss (Binary Cross-Entropy Loss) ∘ 5.2 Derivation of Log Loss ∘ 5.3 Cross-Entropy Loss (Multi-class) ∘ 5.4 Cross-Entropy Loss vs Negative Log-Likelihood · 6. Conclusions · About Me · References
1. Introduction
When learning logistic regression and deep learning (neural networks), I always encounter the terms including:
- Sigmoid function
- Logistic function
- Softmax function
- Log loss
- Cross entropy Loss
- Negative log-likelihood
Every time I see them, I did not really try to understand them, because there are existing libraries out there I can use that do everything for me. For example, when I build logistic regression models, I will directly use sklearn.linear_model.LogisticRegression
from Scikit-Learn. When I work on deep learning classification problems using PyTorch, I know that I need to add a sigmoid activation function at the output layer with Binary [Cross-Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) for binary classifications, or add a (log) softmax function with Negative Log-Likelihood Loss (or just Cross-Entropy Loss instead) for multiclass classification problems.
Recently, when I revisited these concepts, I found it useful to look into the math and understand what was buried underneath. So, in this post, I gathered materials from different sources and I will demonstrate the mathematical formulas with some explanations.
I have also made a cheat sheet for myself, which can be accessed on my GitHub.
2. Sigmoid Function (Logistic Function)
Sigmoid functions are general mathematical functions that share similar properties: have S-shaped curves, just as the figure below shows.


The most common sigmoid function used in machine learning is Logistic Function, as the formula below.

The formula is simple, but it is quite useful because it offers us some nice properties:
- It maps the feature space into probability functions
- It uses exponential
- It is differentiable
For property 1, It is not difficult to see that:
- When x is really large (goes to infinity), the output will be close to 1
- When x is really small (goes to -infinity), the output will be close to 0
- When x is 0, the output will be 1/2
For property 2, a nonlinear relationship ensures most points to be either close to 0 or 1, instead of being stuck in the ambiguous zone in the middle.
Property 3 is also quite important: we need the function to be differentiable to calculate the gradient when updating the weight from errors either using gradient descent in general ML problems or backpropagation in neural networks.
The properties of the logistic function are great, but how is the logistic function used in logistic regression to solve binary classification problems?
3. Logistic Function in Logistic Regression
3.1 Review on Linear Regression
Before going too far, let’s review the concept of regression models. Regression has long been used in statistical modeling and is part of the supervised machine learning methods. It is the process of modeling the relationship between a dependent variable with one or more independent variables.

The most commonly used regression model is linear regression, which predicts values using linear combinations of features. The plot shown above is the simplest form of linear regression, called simple linear regression. It has two parameters β_0 and β_1 where each represents the intercept and slope to define the red best fit line among the data points. With the two parameters trained using the existing data, we will be able to predict a new y value given an unseen x value.

With the simplest form defined, we can generalize the linear regression formula to accommodate multiple dimensions of x, which can also be called multiple linear regression (multivariate regression). Basically, it extends to multiple dimensions and uses multiple features (e.g., house size, age, location, etc.) to make predictions (e.g., sale price).

3.2 Logistic Function and Logistic Regression
Besides predicting actual values as regression, the linear regression models can also be used for classification problems by predicting the probability of the subject in a specific class, this can be simply done by replacing y with p:

The problem is that the probability p here is unbound – it can be any value. So, in order to constrain the probability range to be between 0 and 1, we can use the logistic function introduced in the previous section and map it:

This will make sure that no matter what the predicted value is, the probability p will be in the range between 0 and 1 with all the advantages introduced earlier. However, the exponential form is not easy to deal with, so we can rearrange the formula using the odds function. Odds is the brother of probability, and it represents the ratio between "success" and "nonsuccess". When the p=0, odds is 0; when p=0.5, odds is 1; when p=1, odds is ∞. The relationships are shown below:

With the odds function defined, we get:

It is easy to see the similarity between the two equations, so we have:

We use log to remove the exponential relationship, so it goes back to the term that we are familiar with at the end. The part on the right of the equals sign is still the linear combination of the input x and the parameters β. The part on the left of the equals sign now becomes the logarithm of odds, or giving it a new name logit of probability p. So, the whole equation becomes the definition of the logit function, or log-odds, and it is the inverse function of the standard logistic function. By modeling using the logit function, we have two advantages:
- We can still treat it as a linear regression model using our familiar linear function of the predictors
- We can use it to predict the true probability of the subject in a class— by transforming the predicted value using the inverse logit function.
That is how logistic regression works behind the hood using the logistic function and is perfectly suitable to make binary classification (2 classes): For class A and B, if the predicted probability of being class A is above the threshold we set (e.g., 0.5), then it is classified as class A; on the other hand, if the predicted probability is below (e.g., 0.5), then it is classified as class B.
We have just covered the binary classification using logistic regression. So, what if there are more than 2 classes?
4. Multi-class Classification and Softmax Function
4.1 Methods of Multi-class Classifications
There are several ways of using binary classifiers to handle multi-class classification problems, and two common ones are: one-versus-the-rest and one-versus-one.
The one-versus-the-rest method trains K-1 binary classifiers to separate each class from the rest. Then the instances which are excluded by all of the classifiers will be classified as class K. It works in many cases, but the biggest issue with the one-versus-the-rest method is ambiguous region: where some instances may be put into multiple classes.
On the other hand, we have the one-versus-one method, which trains a binary classifier between each of the classes. Similar to one-versus-the-rest, the ambiguous region also exists here, but this time there exist instances that are not classified into any of the classes. What’s even worse is the efficiency: we need n choose 2 (combination) classifiers for n classes, as shown below in the equation. For example, if we have 10 classes, we need 45 classifiers to use this method!

With these restrictions on one-versus-the-rest and one-versus-one methods, how can we do multi-class classifications then? The answer is to use the softmax function.
4.2 Softmax Function
The Softmax function is a generalized form of the logistic function as introduced in the binary classification part above. Here is the equation:

To interpret it, we can see it as: the probability of classifying the instance as j can be calculated as the exponential of the j th element of the input divided by the sum of exponentials of all the input elements. To better understand it, we can see the example below:

An image classifier gives numerical output after feeding forward through the neural network, in this case, we have a 3×3 array where rows are instances and columns are classes. The first row contains the predictions of the first image: the scores are 5, 4, and 2 for classes cat, dog, and horse respectively. The numbers do not make sense, so we feed them into a softmax function. By plugging the three numbers into the equation, we can get the probability of the image being a cat, dog, and horse to be 0.71, 0.26, and 0.04, which sums up to 1.
Similar to the logistic function, the softmax function also has the following advantages so that people are widely using it in multi-class classification problems:
- It maps the feature space into probability functions
- It uses exponential
- It is differentiable
Another way to interpret the softmax function is through the famous Bayes Theorem, where:

Applying it to our case in softmax, all of the terms can be interpreted as probabilities:

where

5. Cross-Entropy Loss and Log Loss
When we train classification models, we are most likely to define a loss function that describes how much out predicted values deviate from the true values. Then we will use gradient descent methods to adjust model parameters in order to lower the loss. It is a type of optimization problem, and also called backpropagation in deep learning.
Before we start on this, I strongly recommend the article from Daniel Godoy: Understanding binary cross-entropy / log loss: a visual explanation. It does a really good explanation of the practical math concepts underneath and shows them in a visual way. Here in this post, I am using a little bit different conventions, more like wikipedia.
Let’s get started!
5.1 Log Loss (Binary Cross-Entropy Loss)
One commonly used loss function used in classification problems is called cross-entropy loss, or log loss in binary cases. Let’s first place the expression below:

Since the log function has the property that when y is at 0, its log goes to -infinity; when y is at 1, its log is at 0, we can use it to model the loss pretty efficiently. For an instance with true label 0:
- If the predicted value is 0, then the formula above will return a loss of 0.
- If the predicted value is 0.5, then the formula above will return a loss of 0.69
- If the predicted value is 0.99, then the formula above will return a loss of 4.6
As we can see here, the log magnifies the mistake in the classification, so the misclassification will be penalized much more heavily compared to any linear loss functions. The closer the predicted value is to the opposite of the true value, the higher the loss will be, which will eventually become infinity. That’s exactly what we want a loss function to be. So where does the definition of log loss come from?
5.2 Derivation of Log Loss
Cross-Entropy is a concept derived from information theory that measures the difference between two probability distributions, and the definition of it between true probability distribution p and estimated probability q in the information theory is:

where H(p) is the entropy of distribution p, and _DKL(p||q) is Kullback–Leibler Divergence, a divergence of p from q. It is also called relative entropy, of p with respect to q.
The definition of entropy and Kullback-Leibler Divergence are shown as below:

Plugging them in, it is easy to get the expression of cross-entropy:

For binary classification problems, there are only two classes, so we can express them explicitly:

Note that the p here is the probability function instead of the distribution p. Also, we can express the true distribution p(y) as 1/N, so the binary cross-entropy (log loss) can be expressed as:

Note that a minus sign is placed at the beginning because the log function of values between 0 to 1 gives us negative values. We want to flip the sign so that the loss will be positive – we want to minimize the loss.
If we want, this formula can be further expanded in its expression to include the relationship with the model parameters θ, shown below, but it’s essentially the same as what we have above.

5.3 Cross-Entropy Loss (Multi-class)
After deriving the binary case above, we can easily extend it to multi-class classification problems. Below is a generalized form of the cross-entropy loss function. It only sums the log of the probability when the instance class is k, similar to the binary case, where there is always only part of the expression taken account of, and the others are just 0.

Again, it can also be expressed with respect to the model parameters θ, but it is essentially the same equation:

5.4 Cross-Entropy Loss vs Negative Log-Likelihood
The cross-entropy loss is always compared to the negative log-likelihood. In fact, in PyTorch, the Cross-Entropy Loss is equivalent to (log) softmax function plus Negative Log-Likelihood Loss for multiclass classification problems. So how are these two concepts really connected?
Before we dive into it, we have to understand the difference between probability and likelihood. In short:
- Probability: Find the chance of some event given a sample distribution of the data
- Likelihood: Find the best distribution of the data given the sample data
So we are essential modeling the same problems using different expressions, but they are equivalent:

Above is the definition of the likelihood of parameters θ given the data (from x_1 to x_n), which is equivalent to the probability of getting these data (x_1 to x_n) given the parameters θ, and it can be expressed as the product of each individual probability.
Knowing p is the true probability distribution, we can further rewrite the product using the estimated probability distribution as follow:

where q_i (estimated probability distribution) and p_i (true probability distribution) are:

where n_i is the number of times i occurs in the training data. Then by taking the negative log on the likelihood, we can get:

We can easily get the equation above given the log of a product becomes the sum of logs. Magically, the negative log-likelihood becomes the cross-entropy as introduced in the sections above.
6. Conclusions
To summarize the concepts introduced in this article so far:
- Sigmoid Function: A general mathematical function that has an S-shaped curve, or sigmoid curve, which is bounded, differentiable, and real.
- Logistic Function: A certain sigmoid function that is widely used in binary classification problems using logistic regression. It maps inputs from -infinity to infinity to be from 0 to 1, which intends to model the probability of binary events.
- Softmax Function: A generalized form of the logistic function to be used in multi-class classification problems.
- Log Loss (Binary Cross-Entropy Loss): A loss function that represents how much the predicted probabilities deviate from the true ones. It is used in binary cases.
- Cross-Entropy Loss: A generalized form of the log loss, which is used for multi-class classification problems.
- Negative Log-Likelihood: Another interpretation of the cross-entropy loss using the concepts of maximum likelihood estimation. It is equivalent to cross-entropy loss.
Thank you for reading! If you like this article, please follow my channel and/or become my referred member today (really appreciate it 🙏 ). I will keep writing to share my ideas and projects about Data Science. Feel free to contact me if you have any questions.
About Me
I am a data scientist at Sanofi. I embrace technology and learn new skills every day. You are welcome to reach me from Medium Blog, LinkedIn, or GitHub. My opinions are my own and not the views of my employer.
Please see my other articles:
- Time Series Pattern Recognition with Air Quality Sensor Data
- Real-Time Typeahead Search with Elasticsearch (AWS OpenSearch)
- Build REST API for Machine Learning Models using Python and Flask-RESTful
- Loan Default Prediction for Profit Maximization
- Loan Default Prediction with Berka Dataset
References
- Sigmoid Function (Wikipedia): https://en.wikipedia.org/wiki/Sigmoid_function
- Linear Regression (Wikipedia): https://en.wikipedia.org/wiki/Linear_regression
- Practical Statistics for Data Scientists: https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/
- Princeton ML Basics Lecture: https://www.cs.princeton.edu/courses/archive/spring16/cos495/slides/ML_basics_lecture7_multiclass.pdf
- Softmax Classifier: https://datascience.stackexchange.com/a/24112
- Understanding binary cross-entropy / log loss: a visual explanation: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a
- Cross-Entropy (Wikipedia): https://en.wikipedia.org/wiki/Cross_entropy
- Kullback-Leibler Divergence (Wikipedia): https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- Cross-Entropy Loss: https://stats.stackexchange.com/a/262746
- Negative Log Likelihood vs Cross Entropy: https://stats.stackexchange.com/questions/468818/machine-learning-negative-log-likelihood-vs-cross-entropy/468822#468822