Getting Started
A beginner’s guide to multiclass classification

In my previous article, I talked about binary classification with logistic regression.
We had a list of students’ exam scores and GPAs, along with whether they were admitted to their town’s magnet school.

From this data, we wanted to predict whether Sarah would be admitted. We used logistic regression to find the probability that Sarah would be admitted, which turned out to be 0.665. Thus, we categorized her as "admitted."
Now, what if we introduce a third category: waitlist.
Here’s our new dataset:

Given Sarah’s GPA of 4.3 and exam score of 79, can we predict whether she’ll be admitted, rejected, or waitlisted?
This is a multiclass classification because we’re trying to categorize a data point into one of three categories (rather than one of two).
One algorithm for solving multiclass classification is softmax regression.
This article assumes familiarity with logistic regression and gradient descent. Need a refresher? Read this first.
Logic behind Softmax regression
Ultimately, the algorithm is going to find a boundary line for each class. Something like the image below (but not actually the image below):

Note: we as humans can easily eyeball the chart and categorize Sarah as waitlisted, but let’s let the machine figure it out via machine learning yeah?
Just like in linear and logistic regressions, we want the output of the model to be as close as possible to the actual label. Any difference between the label and output will contribute to the "loss" of the function. The model learns via minimizing this loss.
There are 3 classes in this example, so the label of our data, along with the output, are going to be vectors of 3 values. Each value associated with an admission status.
If the label is such that:
admitted = [1, 0, 0]
waitlisted = [0, 1, 0]
rejected = [0, 0, 1]
then the output vector will mean:
[probability of being admitted,
probability of being waitlisted,
probability of being rejected]
Thus, in softmax regression, we want to find a probability distribution over all the classes for each datapoint.

We use the softmax function to find this probability distribution:

Why softmax function? I think this functions is best explained through an example. Let’s look at the example:
GPA = 4.5, exam score = 90, and status = admitted.
When we train a model, we initialize the model with a guessed set of parameters – theta. Through gradient descent, we optimize those parameters. Because we have 3 classes (admitted, rejected, and waitlisted), we’ll need three sets of parameters. Each class will have its own set of parameters.
Let’s have theta be of the shape:
[bias, weight of GPA, weight of exam score]
Let’s initialize thetas to be:
theta_admitted = [-250, 40, 1]
theta_waitlisted = [-220, 40, 1]
theta_rejected = [-220, 40, 1]
Why those values?
Remember that a line is y = mx + b? The line given by the initial thetas would be:
admitted:
-250 + 40x + y = 0
y = -40x + 250
waitlisted:
-220 + 40x + y = 0
y = -40x + 220
rejected:
-220 + 40x + y = 0
y = -40x + 220
If I just eyeball the data, I can see that the line that separates "admitted" from the rest has y-intercept around 250 and slope around -40.
Note: It’s a start, but these parameters are actually never going to work. First, the parameters for waitlisted and rejected are the same, so the parameters will always return the same probability for waitlisted and rejected regardless of what the input is. Second, only the bias differ, and rejected and waitlisted have a bigger bias than admitted (-220 > -250). Therefore, regardless of what the input is, these parameters will return 0 for admitted and 0.5 for the other two.
But it’s okay to start with bad parameters, gradient descent will fix it!
Let’s visualize what the softmax function is doing.
What happens when we run our datapoint through the softmax equation? Again, our datapoint is: GPA = 4.5, exam score = 90.
First, we find the dot product of the parameters and datapoint:

Then, we exponentiate that value to get rid of any potential negative dot products:

Lastly, we normalize it to get a probability distribution:

Because our initial set of parameters are not good, the model output 0.5 for rejected and 0.5 for waitlisted even though the label is admitted.
Essentially, the softmax function normalizes an input vector into a probability distribution. In the example we just walked through, the input vector is comprised of the dot product of each class’ parameters and the training data (i.e. [20, 50, 50]). The output is the probability distribution [0, 0.5, 0.5].
The machine learning algorithm will adjust the bias, weight of GPA, and weight of exam score so that the input vector will produce an output distribution that closely match the input label.
What we really want is our model to output something like:

So, let’s change the parameters for all three classes to get better accuracy.
One way to do this is by gradient descent.
Gradient descent works by minimizing the loss function. In linear regression, that loss is the sum of squared errors. In softmax regression, that loss is the sum of distances between the labels and the output probability distributions.
This loss is called the cross entropy. The formula for one data point’s cross entropy is:

The inner 1{y=k}
evaluates to 1 if the datapoint x^i belongs to class k. 1{y=k}
evaluates to 0 if datapoint x^i does not belong to class k.
Essentially, this function is measuring how similar the label and output vectors are. Here’s a good blog post that goes into detail about this equation.
The total cross entropy, or loss, will be the sum of all the cross entropies.

We take the derivative with respect to theta on this loss in order to do gradient descent.
The new parameters for class k after each iteration is:

Again, 1{y=k}
will be 1 if x^i belongs to class k, and 0 if x^i does not belong to class k.
We use this formula to calculate new thetas for each class.
Now, let’s implement the algorithm to arrive at optimal parameters theta.
Softmax implementation
I implemented the softmax regression for my example here:
They way you’d test in a terminal is:
>>> dataset = [...] # copy it from the gist
>>> from softmax_regression import SoftmaxRegression
>>> s = SoftmaxRegression(dataset)
>>> s.iterate()
Each iteration calculates the total cross entropy and gets new parameters for each class.
After many many MANY iterations, and tweaking of initial parameters, I was able to arrive at the parameters:
theta_admitted = [-392.56407961, 56.75483745, 2.01880429]
theta_waitlisted = [-200.59246564, 33.92260307, 0.89946962]
theta_rejected = [-157.84345476, 26.32255948, 0.70172608]
Let’s test these parameters with the aforementioned datapoint: GPA = 4.5, exam score = 90, and status = admitted. The model should output a value close to 1 for admitted and 0 for the other two statuses.

Ah! So much better.
Now predict whether Sarah would be admitted!
Here’s the probability distribution for GPA = 4.3 and exam score = 79:

Sarah is waitlisted. Sad. But we already knew that was the case.
Here’s the plot with the boundary lines defined by the parameters.

Wait a second…this does not look like clean boundaries. Additionally, Sarah (in gray), looks to be with all the green dots (admitted students). Yet the math indeed gave Sarah a probability of being waitlisted at 99.15%. 🤔
Honestly, this caught me by surprise. I’ve been trying to find a good explanation for how to interpret the parameters geometrically, but so far, not too much luck.
If you have a good explanation for why softmax regression doesn’t produce clean boundaries, please comment below.
Softmax regression, along with logistic regression, isn’t the only way of solving classification problems. These models are great when the data is more or less linearly separable. When the data is not linearly separable, however, we turn to other methods such as support vector machines, decision trees, and k-nearest neighbors.
In a later article, I will compare different learning Algorithms for solving classification problems, and talk about the pros and cons of each. Stay tuned!