Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to multi-class problems.
It can be used to predict the probabilities of different possible outcomes of some event, such as a patient having a specific disease out of a group of possible diseases based on their characteristics (gender, age, blood pressure, outcomes of various tests, etc.).
In this article we will derive the softmax regression model from first principles and show how to use it to solve an image classification task.
Before reading this article, I strongly recommend that you read my previous article on Logistic Regression:
Background: Multi-Class Classification Problems
Recall that in supervised machine learning problems, we are given a training set of n labeled samples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where xᵢ is a m-dimensional vector that contains the features of sample i, and yᵢ represents the label of that sample. Our goal is to build a model whose predictions are as close as possible to the true labels.
In multi-class classification problems, the target label can take any one of k classes, i.e., y ∈ {1, …, k}. For example, in a handwritten digit recognition task, k = 10, since there are 10 possible digits (0–9).
As in binary classification problems, we distinguish between two types of classifiers:
- Deterministic classifiers output a hard label for each sample, without providing probability estimates for the classes. Examples for such classifiers include K-nearest neighbors, decision trees, and SVMs.
- Probabilistic classifiers output probability estimates for the k classes, and then assign a label based on these probabilities (typically the label of the class with the highest probability). Examples for such classifiers include Softmax regression, Naive Bayes classifiers and neural networks that use softmax in the output layer.
The Softmax Regression Model
Given a sample (x, y), the softmax regression model outputs a vector of probabilities p = (_p_₁, …, pₖ)ᵗ, **** where _p_ᵢ represents the probability that the sample belongs to class i:

The probabilities in the vector must sum to 1:

In (binary) logistic regression, our assumption was that the log odds (the logarithm of the ratio between p and 1 − p) was a linear combination of the input features (the vector x).
In softmax regression, we choose one of the probabilities as a reference (let’s say pₖ), and assume that the log-odds ratio between each probability pᵢ and pₖ is a linear combination of the input features.
In other words, we can write the log-odds between pᵢ and pₖ as a dot product of some weight vector wᵢ and the feature vector x:

Note that in softmax regression we have a separate vector of parameters wᵢ for each class i. The set of all the parameters of the model is typically stored in a matrix W of size (m + 1) × k , obtained by concatenating w₁, …, wₖ into columns:

By taking the exponent of both sides of the log-odds equation we get:

Since all the k probabilities sum to one, we can write:

We can now use the expression for pₖ to find all the other probabilities:

Since all the k probabilities must sum to 1, the probability pₖ is completely determined once all the rest are known. Therefore, we have only k − 1 separately identifiable vectors of coefficients. This means that we can arbitrarily choose wₖ = 0 to make sure that exp(wₖᵗ) = 1. This in turn allows us to write the equations above more compactly as follows:

The function that converts the linear functions wᵗx into probabilities is called softmax and is described next.
The Softmax Function
The softmax function, σ(z): _ℝ_ᵏ → ℝᵏ, converts a vector of k real numbers z = _ (z₁, …,_ _z_ₖ)ᵗ into a probability vector _(σ(z₁), …_, σ(zₖ))ᵗ:

It can be easily verified that all the components of σ(z) are in the range (0,1), and that their sum is 1.
The name "softmax" derives from the fact that the function is a smooth approximation of the argmax function. For example, the softmax of the vector (1, 2, 6) is approximately (0.007, 0.018, 0.976), which puts almost all of the unit weight on the maximum element of the vector.
The softmax function is an extension of the sigmoid (logistic) function to the multi-class case. In other words, it can be shown that when there are only two classes softmax becomes the sigmoid function (left as an exercise to the reader).
The softmax function is also commonly used in neural networks to convert the outputs of the final layer of the network into probabilities.
The following diagram summarizes the computational process of the softmax regression model:

Mathematically, this process can be written as follows:

Cross-Entropy Loss
Our goal is to find the set of parameters W that will make the model’s predictions p = σ(w₁ᵗx, …, wₖᵗx) as close as possible to the true labels y.
Note that the output of our model is a vector of probabilities, while the true label is a scalar. In order to make them comparable, we encode the labels using one-hot encoding, i.e., we convert each label y into a binary vector y = (_y_₁, …, yₖ)ᵗ, where yᵢ = 1 for the true class i and 0 elsewhere.
A loss function is used to measure how far our model’s prediction is from the true label. The loss function used in softmax regression is called cross-entropy loss, which is an extension of log loss to the multi-class case. It is defined as follows:

For example, assume that in a three class problem (k = 3), we are given a sample whose true class is class no. 2 (i.e., y = (0, 1, 0)ᵗ), and the prediction of our model for this sample is p = (0.3, 0.6, 0.1)ᵗ. Then the cross-entropy loss induced by this sample is:

Similar to log loss, we can show that cross-entropy loss is the negative of the log-likelihood of the model, under the assumption that the labels are sampled from a categorical distribution (a generalization of Bernoulli distribution to k possible outcomes).
Proof:
Given a model of the data (the labels) as a categorical distribution with probabilities p = (_p_₁, …, pₖ)ᵗ, the probability that a given sample belongs to class i is pᵢ:

Therefore, the probability that the true label of the sample is y is:

Explanation: if the correct class of the given sample is i, then yᵢ = 1, and for all j ≠ i, yⱼ = 0. Hence, P(y|p) = pᵢ, which is the probability that the sample belongs to class i.
Therefore, the log likelihood of our model is:

The cross-entropy loss is exactly the negative of this function. Therefore, minimizing the cross-entropy loss is equivalent to maximizing the log likelihood of the model.
Gradient Descent
As was the case in logistic regression, there is no closed-form solution for the optimal W that minimizes the cross-entropy loss. Therefore, we need to use an iterative optimization method such as gradient descent in order to find the minimum loss.
Fortunately, the cross-entropy loss has a simple gradient (although its derivation is not so simple…). The gradient of the cross-entropy loss with respect to each of the parameter vectors wⱼ is:

Proof:
We first write the cross-entropy loss as an explicit function of the weights:

Using the chain rule, we have that:

We start by computing the first partial derivative. Using the rules for sum of derivatives, the derivative of log and the chain rule, we have:

Next, we compute the partial derivative ∂pᵢ/∂zⱼ (i.e., the derivative of the softmax function) using the quotient rule. We distinguish between two cases:
- If i = j, then:

- If i ≠ j, then:

Together, these two equations give us the derivative of the softmax function:

Using this result, we can finish the computation of the derivative of L:

Since exactly one of the yᵢ is 1 and the others are 0, we can further simplify this derivative to:

The partial derivative of zⱼ with respect to wⱼ is simply the input vector x:

(see this article for basic rules of matrix calculus).
Therefore, the derivative of the cross-entropy loss with respect to each of the weight vectors is:

With these gradients, we can use (stochastic) gradient descent to minimize the loss function on the given training set.
Practice Question
You are given a set of images and you need to classify them into dogs/cats and outdoor/indoor. Should you implement two logistic regression classifiers or one softmax regression classifier?
The solution can be found at the end of the article.
Softmax Regression in Scikit-Learn
The class LogisticRegression can handle both binary and multi-class classification problems. It has a parameter called _multiclass which by default is set to ‘auto’. The meaning of this option is that Scikit-Learn will automatically apply a softmax regression whenever it detects that the problem is multi-class and the chosen solver supports optimization of the multinomial loss (all solvers support it except for ‘liblinear’).
Example: Classifying Handwritten Digits
For example, let’s train a softmax regression model on the MNIST data set, which is a widely used data set for image classification tasks.
The data set contains 60,000 training images and 10,000 testing images of handwritten digits. Each image is 28 × 28 pixels in size, and is typically represented by a vector of 784 numbers in the range [0, 255]. The task is to classify these images into one of the ten digits (0–9).
Loading the Data Set
We first fetch the MNIST data set using the fetch_openml() function:
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
Let’s examine the shape of X:
print(X.shape)
(70000, 784)
X consists of 70,000 vectors, each has 784 pixels.
Let’s display the first 50 digits in the data set:
fig, axes = plt.subplots(5, 10, figsize=(10, 5))
i = 0
for ax in axes.flat:
ax.imshow(X[i].reshape(28, 28), cmap='binary')
ax.axis('off')
i += 1

Next, we scale the inputs to be within the range [0, 1] instead of [0, 255]:
X = X / 255
Feature scaling is important whenever you use an iterative optimization method such as gradient descent to train your model.
We now split the data into training and test sets. Note that the first 60,000 images in MNIST are already designated for training, so we can just use simple slicing for the split:
train_size = 60000
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]
Building the Model
We now create a LogisticRegression classifier with its default settings and fit it to the training set:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
We get a warning message that the maximum number of iterations has been reached:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Let’s increase _maxiter to 1000 (instead of the default 100):
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
This time the learning has converged before reaching the maximum number of iterations. We can actually check how many iterations were needed for convergence by inspecting the _n_iter__ attribute:
print(clf.n_iter_)
[795]
It took 795 iterations for the learning to converge.
Evaluating the Model
The accuracy of the model on the training and the test sets is:
print('Training set accuracy: ', np.round(clf.score(X_train, y_train), 4))
print('Test set accuracy:' , np.round(clf.score(X_test, y_test), 4))
Training set accuracy: 0.9393
Test set accuracy: 0.9256
These results are good, but recent deep neural networks can achieve much better results on this data set (up to 99.91% accuracy on the test!). The softmax regression model is roughly equivalent to a neural network with a single layer of perceptrons that use the softmax activation function. Therefore, it is not surprising that a deep network can achieve better results than our model.
To understand better the errors of our model, let’s display its confusion matrix on the test set:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
y_test_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(cmap='Blues')

We can see that the main confusions of the model are between the digits 5⇔8 and 4⇔9. This makes sense since these digits often resemble each other when written by hand. To help our model distinguish between these digits, we can add more examples from these digits (e.g., by using data augmentation) or extract additional features from the images (e.g., the number of closed loops in the digit).
We can also print the classification report to get the precision, recall and F1 score for each class:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.95 0.97 0.96 980
1 0.96 0.98 0.97 1135
2 0.93 0.90 0.91 1032
3 0.90 0.92 0.91 1010
4 0.94 0.94 0.94 982
5 0.90 0.87 0.88 892
6 0.94 0.95 0.95 958
7 0.93 0.92 0.93 1028
8 0.88 0.88 0.88 974
9 0.91 0.92 0.91 1009
accuracy 0.93 10000
macro avg 0.92 0.92 0.92 10000
weighted avg 0.93 0.93 0.93 10000
As expected, the digits that the model gets the lowest scores on are 5 and 8.
Visualizing the Weights
One of the advantages of softmax regression is that it is highly interpretable (unlike "black box" models such as neural networks). The weight associated with each feature represents the importance of that feature.
For example, we can plot the weights associated with each pixel in each one of the digit classes (wⱼ for each j ∈ {1, …, 10}). This will show us the important segments in the images that are used to detect each digit.
The weights matrix of the model is stored in an attribute called _coef__:
print(clf.coef_.shape)
(10, 784)
Row i of this matrix contains the learned weights of the model for class i. We can display each row as a 28 × 28 pixels image in order to examine the weights associated with each pixel in each one of the classes:
fig, axes = plt.subplots(2, 5, figsize=(15, 5))
digit = 0
for coef, ax in zip(clf.coef_, axes.flat):
im = ax.imshow(coef.reshape(28, 28), cmap='gray')
ax.axis('off')
ax.set_title(str(digit))
digit += 1
fig.colorbar(im, ax=axes.flat)

Pixels with bright shades have a positive impact on the prediction while pixels with dark shades have a negative impact. Pixels with a gray level around 0 have no influence on the prediction (such as the pixels close to the border of the image).
Summary
The pros and cons of softmax regression as compared to other multi-class classification models are:
Pros:
- Provides class probability estimates
- Highly scalable, requiring a number of parameters linear in the number of features
- Highly interpretable (the weight associated with each feature represents its importance)
- Can handle redundant features (by assigning them weights close to 0)
Cons:
- Can find only linear decision boundaries between the classes
- Usually outperformed by more complex models
- Cannot deal with missing values
Solution to the Practice Question
Since the classes are not mutually exclusive (all the four combinations of dog/cat with outdoor/indoor are possible), you should train two logistic regression models. This is a multi-label problem, and one softmax classifier will not be able to handle it.
Final Notes
You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/softmax_regression
All images unless otherwise noted are by the author.
MNIST Dataset Info:
- Citation: Deng, L., 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), pp. 141–142.
- License: Yann LeCun and Corinna Cortes hold the copyright of the MNIST dataset which is available under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA).
Thanks for reading!